Cleaning Data & The Skies¶

Centered Image

Image source - Smog hangs over Los Angeles, California


📖 Background¶

Your are a data analyst at an environmental company. Your task is to evaluate ozone pollution across various regions.

You’ve obtained data from the U.S. Environmental Protection Agency (EPA), containing daily ozone measurements at monitoring stations across California. However, like many real-world datasets, it’s far from clean: there are missing values, inconsistent formats, potential duplicates, and outliers.

Before you can provide meaningful insights, you must clean and validate the data. Only then can you analyze it to uncover trends, identify high-risk regions, and assess where policy interventions are most urgently needed.

💾 The Data¶

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA). The ozone.csv file contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site.

  • "Date" - the calendar date with which the air quality values are associated
  • "Source" - the data source: EPA's Air Quality System (AQS), or Airnow reports
  • "Site ID" - the id for the air monitoring site
  • "POC" - the id number for the monitor (Parameter Occurrence Code)
  • "Daily Max 8-hour Ozone Concentration" - the highest 8-hour value of the day for ozone concentration
  • "Units" - parts per million by volume (ppm)
  • "Daily AQI Value" - the highest air quality index value for the day, telling how clean or polluted the air is (a value of 50 represents good air quality, while a value above 300 is hazardous). For more information, reference AirNow.
  • "Local Site Name" - name of the monitoring site
  • "Daily Obs Count" - number of observations reported in that day
  • "Percent Complete" - indicates whether all expected samples were collected
  • "Method Code" - identifier for the collection method
  • "CBSA Code" - identifier for the core base statistical area (CBSA)
  • "CBSA Name" - name of the core base statistical area
  • "State FIPS Code" - identifier for the state
    • This variable is not in the original dataset, but it could be looked up using the 'County' &/or 'CBSA' variables.
  • "State" - name of the state
    • This variable is not in the original dataset, but it could be looked up using the 'County' &/or 'CBSA' variables.
  • "County FIPS Code" - identifer for the county
  • "County" - name of the county
  • "Site Latitude" - latitude coordinates of the site
  • "Site Longitude" - longitude coordinates of the side

💪 Competition challenge¶

Create a report that covers the following:

  1. Your EDA and data cleaning process.
  2. How does daily maximum 8-hour ozone concentration vary over time and regions?
  3. Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels?
  4. Consider if urban activity (weekend vs. weekday) has any effect on ozone levels across different days.
  5. Bonus: plot a geospatial heatmap showing any high ozone concentrations.

🧑‍⚖️ Judging criteria¶

CATEGORY WEIGHTING DETAILS
Recommendations 35%
  • Clarity of recommendations - how clear and well presented the recommendation is.
  • Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?
  • Number of relevant insights found for the target audience.
Storytelling 35%
  • How well the data and insights are connected to the recommendation.
  • How the narrative and whole report connects together.
  • Balancing making the report in-depth enough but also concise.
Visualizations 20%
  • Appropriateness of visualization used.
  • Clarity of insight from visualization.
Votes 10%
  • Up voting - most upvoted entries get the most points.

✅ Checklist before publishing into the competition¶

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
  • Make sure the workbook reads well and explains how you found your insights.
  • Try to include an executive summary of your recommendations at the beginning.
  • Check that all the cells run without error

Objectives¶

Create a report that covers the following: 1) Your EDA and data cleaning process. 2) How does daily maximum 8-hour ozone concentration vary over time and regions? 3) Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels? 4) Consider if urban activity (weekend vs. weekday) has any effect on ozone levels across different days.

Bonus: plot a geospatial heatmap showing any high ozone concentrations.

Summary of Findings & Recommendations¶

For more information regarding details of analyses & findings for each objective, they can be found at the end of the respective Analysis section, specifically in the Analysis - Conclusions section. There were four main objectives in this project, & so there were four sections of analyses, although section three was split into two. While these conclusion sections pertain more to findings & explanations of the analyses, this summary section draws from the conclusions to inform users based on the background context of this project. Recall that a primary goal of this project involved cleaning & using the data to uncover insights regarding ozone levels & trends.

1) Exploratory data analysis & data cleaning process.¶

Of the 17 variables in the dataset, three of them required cleaning--'Date,' 'Local Site Name,' & 'County.' All three contained some inconsistencies regarding how values were formatted & such, but the 'Date' variable was the "dirtiest" by far. Interpolation was used to preserve about 8,600 data points which originally had vague, unspecific dates. Even though this process was a little rudimentary, it was still reasonable & quite effective for the purposes of this project. Furthermore, some data points were duplicated. Upon their removal, the size of the dataset decreased by another 211 observations.
When it came to outliers, the 'Daily Obs Count' variable has one strange value. There were 547 instances of "1,000" although the remaining data points had a numerical value between one & 24. Ultimately, there wasn't enough information to determine whether these values were appropriate & if they were outliers, so they were left unaltered.

The original dataset had 54,759 observations & 17 variables. The cleaned version of the dataset, defined as clean_df, has 54,027 total data points & 20 variables, though four of them pertain to the date.


2) How does daily maximum 8-hour ozone concentration vary over time and regions?¶

The Analysis II section examined the data across counties to determine how ozone levels varied over time in California throughout 2024. There are 58 unique counties in California but only 48 had observations in the dataset. A variety of figures, including geospatial heatmaps plus some arithmetic helped to uncover which counties & areas in California exhibited the greatest variance in ozone concentrations during 2024.

Across the entirety of California, it was found that ozone levels peaked in June & July at around 0.050 parts per million (ppm) & then bottomed out in December & January at just over 0.030 ppm.
This trend is replicated in most of the Californian counties during 2024. Additionally, counties that saw higher maximum ozone levels typically exhibited greater variance in terms of their ozone levels. At the beginning of the year, differences between average ozone levels were fairly similar across the 48 counties; however, these differences continued to widen from around April through July when most counties' ozone levels peaked in 2024. From then on, counties' average ozone levels generally declined as did the gaps between their respective average ozone levels.

Through the handful of visuals presented in Analysis II, it was found that average monthly ozone levels meaningfully differed across three general areas in California.
To summarize, areas towards the southern & central parts of California exhibited more variance & greater average ozone levels compared to the rest of the state. On the other hand, counties directly along the western coast as well as parts of northern California exhibited the least variance & least severe ozone concentrations during 2024.

Recommendations: Note that the EPA deems an ozone concentration of 0.070 ppm or greater to be unhealthy & dangerous. While the most extreme average monthly ozone values in California did not reach this threshold, there were a handful of data points (about 2,500) that observed a concentration greater than this value. The areas that exhibit the most severe ozone concentrations warrant the most resources in terms of receiving aid to help against potential health hazards & dangers associated with high ozone levels. Moreover, organizations that provide such aid should expect greater demand for such resources in the late spring & early summer when ozone levels, at least in California, tend to peak.
Other organizations & agencies that may be more involved & dedicated to fighting climate change may want to follow the evolution of ozone levels consistently so that if they continue to rise in the following years, they can be prepared to warn populations & those affected of what to expect & the health risks they may risk from high ozone levels.


3A) Are there any areas that consistently show high ozone concentrations?¶

Two main metrics were used in Analysis III-A to evaluate this prompt because it inquired about two unique elements of ozone concentrations. Importantly, this section sought areas in California that exhibited both (not one or the other) consistency & large magnitudes of ozone levels. To determine which areas had higher ozone levels, averages were used, whereas for consistencies of ozone levels, the variance metric was used.
Counties were used to evaluate ozone levels on a geographic basis. Additionally, ozone levels of the counties were analyzed on a monthly basis to be able to evaluate consistency & how much they evolved.

The counties that best characterized these two metrics include the following ten counties: Amador, Butte, El Dorado, Inyo, Mariposa, Nevada, Shasta, Sutter, Tuolumne, & Ventura.

  • Across both metrics, the Mariposa, Inyo, El Dorado, Ventura, Tuolumne counties exhibited ozone levels in 2024 that were the most severe & the most consistent.

The EPA considers ozone levels of 0.07 ppm or higher to be unhealthy & potentially dangerous. The average monthly ozone levels of these ten most relevant counties peaked at about 0.06 ppm in July, so these areas were not in imminent danger in 2024; however, with increasing levels of climate change & warming temperatures, this could change soon.

Recommendations: The areas identified in this section are at the greatest risk of suffering health problems from high ozone levels. As such, the relevant populations will likely require the most assistance in mitigating such risks particularly in the late sping & early summer when ozone levels are likely to peak.
On another note, these areas can also serve as a sort of scientific experiment in order to better understand consequences of higher ozone levels, assuming they aren't well-understood already.

3B) Do different methods report different ozone levels?¶

In the dataset, there are four defined collection methods which are named using numerical codes; however, there were about 6,300 data points whose method code was missing. Instead of ignoring or removing them from this section, they were compiled into a single group with a "Missing" method code. So, five groups were analyzed in total in Analysis III-B.
Two general subanalyses were performed which involved evaluating the distributions & averages of ozone levels across these five method groups across both 2024 as a whole & on a monthly basis.

In California in 2024, the "53" collection method recorded considerably higher ozone readings than any other (~ 0.060 ppm), whereas the "47" method had the smallest ozone values on average (~ 0.042 ppm). The ozone values associated with the method codes "199" & "87" were generally indifferent (~ 0.045 ppm), but they were typically greater than the average ozone value across the entire dataset (~ 0.044 ppm). Ozone values for observations with an undefined 'Method Code' were noticeably lower than any of the four collection method codes (~ 0.036 ppm).

On a yearly basis, the time of year appeared to have an impact on ozone levels across the five groups of method codes. Generally, the differences in average ozone levels were greater when ozone concentrations were more intense.

  • From January through March as well as November through December--when ozone levels were at their lowest in California in 2024--the average ozone readings for each of the five method codes were quite similar. Ozone levels across the five groups were the most similar in February. When excluding the "Missing" method code group, the average monthly ozone levels of the four distinct method codes are also fairly similar in April & May.
    • For example, in February, the average ozone levels across the five groups were between about 0.033-0.037 ppm.
  • From May through October, the disparities in average ozone levels across the five groups is most significant, particularly in July when ozone levels generally peaked across the state in 2024.
    • In July, the average ozone levels across the five groups were between about 0.036-0.079 ppm.

Recommendations: Ultimately, this dataset doesn't provide much information regarding these collection methods or why there are multiple ones. Nonetheless, there could be a variety of reasons behind these differences in ozone observations including different locations, altitudes, other geographic factors (e.g. wind & cloud patterns or proximity to a sea/forest), human population density, technological factors (were all the collection methods built using the same technology?), etc.
Given that differences in ozone levels were more significant when ozone levels peaked, this could indicate a potential underlying bias in the methods. For example, perhaps certain collection methods resided in areas where ozone levels were naturally higher than others at certain points in the year. Alternatively, if the methods differed due to technological abilities, maybe some technologies falter more when ozone levels become more severe.


4) Consider if urban activity (weekend vs. weekday) has any effect on ozone levels across different days.¶

It was theorized that ozone levels may differ between weekdays & weekends because different kinds & quantities of urban activity usually take place during the week & on the weekend. For example, a large chunk of the general population has to commute to & from their workplace between Monday & Friday, whereas they are more likely to stay home on the weekend.
By evaluating ozone levels between weekdays & weekends as well as on a daily basis, this theory was shown to be inconclusive.

First, ozone distributions were analyzed across each day of the week (Sunday through Saturday) & between weekdays (Monday through Friday) & weekends (Saturday & Sunday). This was done across the entire state of California in general & on a monthly basis.
In addition, the county of Los Angeles was analyzed in isolation to determine if the ozone patterns were consistent with what was found previously.

While there were some dissimilarities in average ozone levels between weekdays & weekends on a monthly basis throughout 2024, the magnitudes of these disparities are still not considerable enough to credibly state that urban activity had a considerable effect on ozone levels between the week & weekend. This was found to be the case throughout the whole state, on a monthly basis, & in the county of Los Angeles.

Recommendations: Given the small disparities in ozone levels between weekdays & weekends observed in this section, a more sophisticated analysis may be needed to further evaluate the consequences of urban activity in terms of ozone levels. More specifically, a dataset could be compiled for an area (e.g. a state county) that rigorously & consistently tracks urban activity. Unfortunately, such a process would likely be extremely tedious depending on the size & population of said area, not to mention it being potentially illegal depending on the area being "surveyed."
Another alternative could involve people voluntarily submitting their urban activity over a period of time such that consistent patterns & activities can be observed. Unfortunately, unless these volunteers are in an isolated area, following ozone concentrations strictly related to them seems like an unrealistic process. Additionally, a significantly large population size might be necessary in order to observe any significant ozone levels.
Ultimately, a clearer & more consistent understanding of urban activity may be necessary in order to determine the effect on ozone levels across different days. Of course, the analysis done in Analysis IV could be extended across different locations & compared, but this would likel make the process even more complicated.


Import packages¶

In [1]:
import pandas as pd
import numpy as np
import statistics
import itertools

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import geopandas as gpd

Import & inspect the data¶

Inspect the dataset to determine if there are missing values, inconsistencies, or anything that may require data cleaning.

In [2]:
#import the data
orig_data = pd.read_csv('ozone.csv')

#inspect the dataset
display(orig_data.info())
display(orig_data.sample(3, random_state=22))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54759 entries, 0 to 54758
Data columns (total 17 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Date                                  54759 non-null  object 
 1   Source                                54759 non-null  object 
 2   Site ID                               54759 non-null  int64  
 3   POC                                   54759 non-null  int64  
 4   Daily Max 8-hour Ozone Concentration  52021 non-null  float64
 5   Units                                 54759 non-null  object 
 6   Daily AQI Value                       52021 non-null  float64
 7   Local Site Name                       54759 non-null  object 
 8   Daily Obs Count                       54759 non-null  int64  
 9   Percent Complete                      54759 non-null  int64  
 10  Method Code                           48269 non-null  float64
 11  CBSA Code                             52351 non-null  float64
 12  CBSA Name                             52351 non-null  object 
 13  County FIPS Code                      54759 non-null  int64  
 14  County                                54759 non-null  object 
 15  Site Latitude                         54759 non-null  float64
 16  Site Longitude                        54759 non-null  float64
dtypes: float64(6), int64(5), object(6)
memory usage: 7.1+ MB
None
Date Source Site ID POC Daily Max 8-hour Ozone Concentration Units Daily AQI Value Local Site Name Daily Obs Count Percent Complete Method Code CBSA Code CBSA Name County FIPS Code County Site Latitude Site Longitude
36753 /2024 AQS 60731201 1 0.019 ppm 18.0 Pala Airpad 17 100 87.0 41740.0 San Diego-Carlsbad, CA 73 San Diego 33.362593 -117.09023
28162 October 28/2024 AQS 60659001 1 0.041 ppm 38.0 Lake Elsinore 17 100 87.0 40140.0 Riverside-San Bernardino-Ontario, CA 65 Riverside 33.676490 -117.33098
15762 2/19/2024 AQS 60371602 1 0.029 ppm 27.0 Pico Rivera #2 17 100 87.0 31080.0 Los Angeles-Long Beach-Anaheim, CA 37 Los Angeles 34.010290 -118.06850

In the original dataset, there are 54,759 observations (rows) & 17 variables (columns). Note that 19 variables are mentioned in The Data section above that describes the nature of each--the 'State' & 'State FIPS Code' variables are not in the original dataset. This is likely because such variables would be negligible as all of the data corresponds to California. For reference, California's FIPS code is 06.

At first glance at the dataset, it is evident that some variables require some cleaning. For example, the 'Date' variable include inconsistent values such as "/2024," "October 28/2024," & "2/19/2024." To determine which variables need cleaning would necessitate going through each of them & observing their respective values. Moreover, some variables should be converted to different data types. As an example, the 'Date' variable could be converted from objects to datetime objects. Other data cleaning processes include handling format inconsistencies, typos, inappropriate data types, looking for & handling outliers, & more.

Regarding missing values, there are five variables with missing values--'Daily Max 8-hour Ozone Concentration' (2,738 missing values), 'Daily AQI Value' (2,738), 'Method Code' (6,490), 'CBSA Code' (2,408), & 'CBSA Name' (2,408).

The dataset also needs to be checked for duplicate observations. Primarily, observations that are completely identical should be looked at as such data points could inappropriately influence analyses of the data. If the duplicate observation(s) serve no purpose, then they should be removed.


Analysis I¶

Exploratory Data Analysis (EDA) & data cleaning process.

Data cleaning can involve many processes such as checking for data inconsistencies, potential entry errors, missing values, duplicate data points, inappropriate data types, checking values of variables to look for potential outliers, etc.

To start, each of the 17 variables were explored to evaluate whether they need cleaning or not.

In [3]:
#Evaluate values of each variable
    #display(orig_data.columns)
display(orig_data['CBSA Name'].value_counts(dropna=False).sort_index())

#Check distribution for numerical variables
#display(orig_data['Site Longitude'].describe().reset_index().T)
CBSA Name
Bakersfield, CA                                  2878
Bishop, CA                                       1009
Chico, CA                                         719
Clearlake, CA                                     360
El Centro, CA                                    1204
Eureka-Arcata-Fortuna, CA                         344
Fresno, CA                                       2541
Hanford-Corcoran, CA                              363
Los Angeles-Long Beach-Anaheim, CA               5044
Madera, CA                                        725
Merced, CA                                        375
Modesto, CA                                       730
Oxnard-Thousand Oaks-Ventura, CA                 1833
Red Bluff, CA                                     583
Redding, CA                                       976
Riverside-San Bernardino-Ontario, CA             7188
Sacramento--Roseville--Arden-Arcade, CA          5048
Salinas, CA                                      1100
San Diego-Carlsbad, CA                           3227
San Francisco-Oakland-Hayward, CA                3508
San Jose-Sunnyvale-Santa Clara, CA               1835
San Luis Obispo-Paso Robles-Arroyo Grande, CA    2181
Santa Cruz-Watsonville, CA                        371
Santa Maria-Santa Barbara, CA                    3198
Santa Rosa, CA                                    276
Sonora, CA                                        365
Stockton-Lodi, CA                                 716
Truckee-Grass Valley, CA                          361
Ukiah, CA                                         446
Vallejo-Fairfield, CA                             956
Visalia-Porterville, CA                          1306
Yuba City, CA                                     585
NaN                                              2408
Name: count, dtype: int64

Variables that do not need cleaning:

  • 'Source', 'Site ID', 'POC', 'Daily Max 8-hour Ozone Concentration', 'Units', 'Daily Obs Count', 'Percent Complete', 'CBSA Name', 'County FIPS Code', 'Site Latitude', 'Site Longitude'

Variables that do need some cleaning:

  • 'Date' - Inconsistent formatting. Some values only list the year rather than year, month, & day. The data type could be converted to a datetime object. There are also several vague dates labelled "/2024" which are worth investigating.
  • 'Local Site Name' - There is some inconsistent formatting & capitalization (e.g. "Auburn-Atwood" vs "PLACERBILLE- CANAL STREET"). Only a handful of values have such problems. In the 'CBSA Name' variable, values are formatted as "[name]-[name]." As such, this format should be used when fixing inconsistencies in this variable.
  • 'County' - Some data points abbreviated Los Angeles as "LA" & San Francisco as "SF."

Furthermore, the following variables have data types which could be converted from floats to integers because they do not have any values with non-zero decimal values. It should be pointed out that these variables defaulted to float types as opposed to integers because they each have missing values. In pandas, a variable in a dataframe cannot have integer data types if it also possesses missing values. As such, if the data types of these variables were to be converted as described, the missing values will have to be deal with beforehand.

  • 'Daily AQI Value', 'Method Code', 'CBSA Code'

Firstly, the 'County' variable was cleaned up.¶

In [4]:
#Clean the 'County' variable

#Make a new version of the dataset to avoid modifying the original
df_I = orig_data.copy()

#replace the "SF" values with "San Francisco"
df_I.replace({'County':'SF'}, 'San Francisco', inplace=True)
#replace the "LA" values with "Los Angeles"
df_I.replace({'County':'LA'}, 'Los Angeles', inplace=True)

#Check the changes -- GOOD
#display(orig_data.County.value_counts(dropna=False).sort_values(ascending=False))
#display(df_I.County.value_counts(dropna=False).sort_values(ascending=False))

Next, the 'Local Site Name' variable was cleaned up.¶

When hyphens are included in a value, the desirable format follows as: "[name]-[name]" with no spaces. There are 18 such values to clean that are formatted as one of the following: "[name] - [name]", "[name]- [name]", or "[name] -[name]".

In [5]:
#Clean the 'Local Site Name' variable. There are 18 values to clean including:
    #"PLACERVILLE- CANAL STREET", "Lancaster - Fairgrounds", "Ojai - East Ojai Ave", 
    #"Joshua Tree NP - Cottonwood Visitor Center", "San Diego - Sherman Elementary School", "Fresno - Garland",
    #"Piru - Pacific", "San Diego - Kearny Villa Rd.", "Death Valley NP - Park Village", 
    #"Sequoia & Kings Canyon NPs - Ash Mountain", "Stockton - University Park", "Shasta Lake - Lake Blvd",
    #"Redding - Health Department", "San Jose - Jackson", "El Cajon - Lexington Elementary School",
    #"Trona - Athol/Telescope #2", "Berkeley- Aquatic Park", "Sequoia & Kings Canyon NPs - Lower Kaweah"

#To retrieve these values, use the following code
#display(orig_data[(orig_data['Local Site Name'].str.contains(' - ')) | (orig_data['Local Site Name'].str.contains('- ')) | 
#    (orig_data['Local Site Name'].str.contains(' -'))]['Local Site Name'].value_counts(dropna=False).reset_index())

#Fix instances of " - ". Need to use regex when using string patterns
df_I['Local Site Name'] = df_I['Local Site Name'].str.replace(r'\s*-\s*', '-', regex=True)
#check -- GOOD
    #display(df_I[df_I['Local Site Name'].str.contains(' -')]['Local Site Name'].value_counts(dropna=False).reset_index())

#Fix capitalization ONLY for the value: "PLACERVILLE- CANAL STREET"
for i in df_I.index:
    if df_I.loc[i,'Local Site Name'] == 'PLACERVILLE-CANAL STREET':
        df_I.loc[i,'Local Site Name'] = df_I.loc[i,'Local Site Name'].title()
#check -- GOOD
    #display(df_I[df_I['Local Site Name'].str.contains('PLACERVILLE|lacerville')]['Local Site Name'].unique())

#Check the resulting values -- GOOD
    #display(df_I['Local Site Name'].value_counts().reset_index().iloc[150:,:])

Lastly, the 'Date' variable needs cleaning.¶

There are two major formats & one abnormal value which are formatted as the following: "9/9/2024" (month/day/year), "April 04/2024", & "/2024" respectively. For both of the two major formats, all 12 months are accounted for. For the latter date, which only indicates the year, there are 9,202 data points associated with it (about 16.8% of the dataset). No other date corresponds with more than 122 data points.

For this variable, the desired format is ISO 8601, or: YYYY-MM-DD. For the two major formats that are observable in the original dataset, reformatting them is easy enough. For the observations with the 'Date' "/2024", these data points are more meticulous particularly given how many corresponding data points there are. To determine how to handle these values, the dataset was looked at more closely.

In looking at the first 60 data points in the original dataset as they are naturally sorted, the three different formats are interspersed seemingly at random, but the dates appear to be in some form of a chronological order; however, there are multiple inconsistencies which are listed below.

  • Some dates are flat out missing. For example, there are several dates skipped at indices 23 & 24 which have the respective dates "1/24/2024" & "1/30/2024."
  • Some dates are also missing but they appear to coincide with observations with the date "/2024." For example, the dates of the first four observations in the dataset are: "/2024," "1/2/2024," "/2024," & "January 04/2024." Logical reasoning would presume that these missing dates should be January 1st & January 3rd, 2024 respectively.
  • The dates of the first 25 observations span "1/2/2024" through "1/30/2024," but the dates of the subsequent observations return to the beginning of January 2024. The next 30 observations go through January chronologically again. At the 57th observation, the date changes to February--"2/2/2024"--& proceeds once more in chronological order.

Over the next 60 observations, these patterns all persist except that each month is not proceeded through twice like January was. To determine whether these patterns continue throughout the rest of the dataset would be difficult to do by hand given how many observations there are (nearly 55,000). On the other hand, these patterns start breaking down the more the dataset was looked at.

  • At index 143, the dataset skips May 2024--the values proceed as: "4/29/2024," "/2024," "/2024," & "6/4/2024." Additionally, these two middle dates aren't logically interpretable like they have been up until this observation.
  • There are several other data points with the "/2024" placeholder whose date is also not logically interpretable based on the dates of the adjacent data points. For example, at indices 288-290, the dates are: "10/27/2024," "/2024," & "11/1/2024." This placeholder could be 10/28, 10/29, 10/30, or 10/31 in 2024, & so it is not obviously interpretable.
  • At index 349, the dates go from the end of December 2024 back to the beginning of January 2024; "December 30/2024," "/2024," & "1/2/2024." Again, if this "/2024" date is a placeholder, it's (hypothetical) true date is unclear as it could be 12/31 or 1/1 in 2024.
  • At index 706, the dates again change back from late December 2024 to early January 2024; "12/30/2024," "/2024," "1/2/2024," ...
  • Soon after these indices (from 709 onwards), many more dates start to get skipped. For example, some dates (in order of the dataset) are: "1/3/2024," "1/4/2024," "/2024," "January 13/2024," "1/14/2024," "1/21/2024," "2/11/2024," "2/12/2024," "2/15/2024," "2/18/2024," ..., "2/25/2024," "4/8/2024," & so on.

In looking only at snapshots throughout the dataset, it appears that these inconsistencies & patterns generally continue up through the last 500 or so data points where the dates become totally random. For example, at index 54,260 the dates follow as: "December 04/2024," "3/9/2024," "3/2/2024," "1/5/2024," "/2024," "11/19/2024," "7/23/2024," etc. At this point, interpreting the "/2024" placeholder dates becomes impossible.

All of these previous points serve to point out the challenges associated in trying to interpret or interpolate the "/2024" placeholder dates. While many are potentially interpretable based on the adjacent data points with non-placeholder dates, a sophisticated method or function would have to be applied to do so appropriately & consistently throughout the entire dataset.

When it comes to the "/2024" dates, it is desirable to avoid dropping them from the dataset because it would remove a significant portion of the data. However, having vague dates like "/2024" don't really serve much of a purpose when it comes to data analysis because they aren't consistent with other dates in the dataset.

In [6]:
#Investigate the 'Date' variable

#display(orig_data.iloc[:30,:5])
#display(orig_data.iloc[54730:,:5])
#orig_data.iloc[700:760,:5]

To start cleaning this variable, the two major formats were cleaned up to fit the aforementioned ISO 8601 standards. In order to do this, the data had to be split into observations with actual specific dates (i.e. "2024-09-10") & those with the vague dates ("/2024"). A key part of this process involves making sure that the order of observations in the dataset remains unchanged. This allows the reintegration of the data later on to be more seamless.

In [7]:
#Clean the 'Date' variable.

#Isolate data points with 'Date's not equal to "/2024", reformat them to "YYYY-MM-DD". Make sure to NOT adjust indices.
    #Rename original 'Date' column to preserve it in the new dataset.
df_good_dates = df_I[df_I['Date'] != '/2024'].rename(columns={'Date':'Orig Date'})
    #45,557 rows
df_good_dates['Date'] = pd.to_datetime(df_good_dates['Orig Date'], format='mixed').dt.strftime('%Y-%m-%d')
#check -- GOOD
    #display(sorted(df_good_dates['Date'].unique()))

#Return to original data, remove data points where 'Date' doesn't equal "/2024", & 
    #insert the data above with the reformatted dates. Make sure to NOT adjust indices whilst doing so. At the end, 
    #re-sort by the index & it should be in the same order as the original data.
    #Rename original 'Date' column to preserve it in the new dataset.
#Need to extract the vague dates where 'Date' equals "/2024".
df_weird_dates = df_I[df_I['Date'] == '/2024'].rename(columns={'Date':'Orig Date'})
    #9,202 rows
#Combine the reformatted specific dates (i.e. "2024-08-14") & the original vague dates ("/2024").
df_I_A = pd.concat([df_good_dates, df_weird_dates], axis=0).sort_index()
    #54,759 rows
#Convert the new date column to datetime types
df_I_A['Date'] = pd.to_datetime(df_I_A['Date'])

#Check that the datasets are ordered appropriately -- GOOD
display(df_I.equals(df_I_A.drop(columns=['Date']).rename(columns={'Orig Date':'Date'}).sort_index()))
True

Now that the specific dates have been altered to fit a single format, the vague dates ("/2024") can be approached. With the conclusions regarding these data points so far, these vague dates could be (& will be moving forward) deemed as "interpretable" or not. Whether a 'Date' of "/2024" is defined as "interpretable" depends on the 'Date's of observations adjacent to it. Recall that a majority of the original dataset appears to be sorted by 'Date' chronologically. If a "/2024" date is surrounded by specific dates that follow a logical, chronological order & there are an appropriate number of such vague dates, then it is possible that such dates are interpretable. As an example, consider the following cases.

  • Consider the indices 113-118 which have the respective dates: "2024-03-30," "2024-03-31," "/2024," "/2024," "2024-04-03," & "2024-04-04." These vague dates could be interpreted to be the dates between the most adjacent non-vague dates ("2024-03-31" & "2024-04-03"). Going off chronological order, there are two dates in between these most adjacent dates & two data points. As such, these two vague dates can reasonably be interpreted to be the dates "2024-04-01" & "2024-04-02" respectively.
  • Consider the indices 151-156 which have the respective dates: "2024-06-09," "/2024," "/2024," "/2024," "2024-06-14," & "2024-06-15." These vague dates could be interpreted to be the dates between the most adjacent non-vague dates ("2024-06-09" & "2024-06-14"); however, there are four dates in between these dates (based on a chronological order) & only three data points. Therefore, it is not obviously apparent as to what these vague, placeholder dates might be. Such vague dates can be deemed as not interpretable.

This process of what dates are interpretable or not will form the basis for cleaning these "/2024" 'Date's moving forward. So, for each observation whose 'Date' is "/2024," the following variables need to be obtained.

  • The most previous specific date & how many rows/indices/observations prior it is.
  • The immediately subsequent specific date & how many rows/indices/observations after it is.

These variables can then be used to determine if a vague date ("/2024") is "interpretable" or not. Then, the "interpretable" dates can be interpolated.

When undergoing this process, some of the vague dates are directly adjacent to observations that also have vague dates. This would necessitate muliple "for" loops when determining the closest preceding & subsequent data point with a specific date. However, it is necessary to know how many "for" loops are necessary. In other words, with how the original dataset is sorted, what is the longest streak of consecutive observations whose 'Date's were vague ("/2024"). This streak will dictate how many "for" loops are necessary in order to look backwards or forwards in the dataset. A custom function was created in order to find this longest streak of consecutive "/2024" dates.

In [8]:
#Create a new dataframe focusing on the date variables
date_testing = df_I_A[['Orig Date','Date']].copy()
In [9]:
#Want to find the longest streak of consecutive "/2024" 'Date's

def longest_consecutive_dates(nums):
    '''
    A function was created to find the longest streak of consecutive observations which had a 'Date' of "/2024."
    The inserted data should be a list of relevant indices.
    This function will be used for the loop in the following cells which looks for 'Date's of most adjacent data points.
    '''
    #create a list containing the consecutive elements of the longest steak
    longest_streak = []
    #retrieve the current index being looked at
    current_streak = [nums[0]]
    #loop through the indices 1 by 1
    for i in range(1, len(nums)):
        #look for consecutive indices
        if nums[i] == nums[i - 1] + 1:
            #add the current element of the streak
            current_streak.append(nums[i])
        #once the consecutive streak ends
        else:
            #check if the current streak is longer than the longest streak observed so far
            if len(current_streak) > len(longest_streak):
                #if so, adjust the current longest streak
                longest_streak = current_streak
            #retrieve the indices of the longest streak
            current_streak = [nums[i]]

    #check if the current streak is longer than the longest streak observed
    if len(current_streak) > len(longest_streak):
        #if so, adjust the current longest streak
        longest_streak = current_streak
    #return the elements of the longest streak observed
    return longest_streak

#retrieve the result with the consecutive indices & the # consecutive elements
longest_consecutive_indices_nulldates = longest_consecutive_dates(orig_data[orig_data['Date'] == '/2024'].index.tolist())
display("Longest streak of consecutive indices: {}".format(len(longest_consecutive_indices_nulldates)))
display("Associated indices in original dataset: {}".format(longest_consecutive_indices_nulldates))

#retrieve the associated data
display(orig_data.iloc[12428:12436,:4])
'Longest streak of consecutive indices: 6'
'Associated indices in original dataset: [12429, 12430, 12431, 12432, 12433, 12434]'
Date Source Site ID POC
12428 August 29/2024 AQS 60292012 1
12429 /2024 AQS 60292012 1
12430 /2024 AQS 60292012 1
12431 /2024 AQS 60292012 1
12432 /2024 AQS 60292012 1
12433 /2024 AQS 60292012 1
12434 /2024 AQS 60292012 1
12435 September 06/2024 AQS 60292012 1

Given that the longest streak of data points with a 'Date' of "/2024" is six observations, a loop needs to look backwards/forwards to the sixth most previous or subsequent data point at most. In other words, the following loop needs to be iterated six times.

In [10]:
#FINDING MOST PREVIOUS 'DATE'S

#loop through each row in dataset
for i in date_testing.index:
    #for each "/2024" (null) 'Date', find the most previous non-null 'Date' & how many rows/indices prior it was
    if pd.isna(date_testing.loc[i,'Date']):
        
        #need to apply a different condition for the first observation (index 0) which has a 'Date' of "/2024"
        if i == 0:
            #there is no previous date because there is no previous observation
            date_testing.loc[i,'Most Previous Actual Date'] = pd.NaT
            date_testing.loc[i,'# Indices Prior'] = np.nan
        
        #for all other indices:
        else:
            
            #if the most previous 'Date' is also null
            if pd.isna(date_testing.loc[i-1,'Date']):
            
                #if the 2nd most previous 'Date' is also null
                if pd.isna(date_testing.loc[i-2,'Date']):
                
                    #if the 3rd most previous 'Date' is also null
                    if pd.isna(date_testing.loc[i-3,'Date']):
                    
                        #if the 4th most previous 'Date' is also null
                        if pd.isna(date_testing.loc[i-4,'Date']):
                        
                            #if the 5th most previous 'Date' is also null
                            if pd.isna(date_testing.loc[i-5,'Date']):
                            
                                #if the 6th most previous 'Date' is also null
                                if pd.isna(date_testing.loc[i-6,'Date']):
                                    #Based on the function created & used in the previous cell, this should not happen.
                                    date_testing.loc[i,'Most Previous Actual Date'] = "PROBLEM"
                                    date_testing.loc[i,'# Indices Prior'] = 'PROBLEM'
                                #if the 6th most previous 'Date' is NOT null
                                else:
                                    #retrieve the associated 'Date'
                                    date_testing.loc[i,'Most Previous Actual Date'] = date_testing.loc[i-6,'Date']
                                    #determine how many rows/indices prior it was
                                    date_testing.loc[i,'# Indices Prior'] = 6
                        
                            #if the 5th most previous 'Date' is NOT null
                            else:
                                #retrieve the associated 'Date'
                                date_testing.loc[i,'Most Previous Actual Date'] = date_testing.loc[i-5,'Date']
                                #determine how many rows/indices prior it was
                                date_testing.loc[i,'# Indices Prior'] = 5
                        
                        #if the 4th most previous 'Date' is NOT null
                        else:
                            #retrieve the associated 'Date'
                            date_testing.loc[i,'Most Previous Actual Date'] = date_testing.loc[i-4,'Date']
                            #determine how many rows/indices prior it was
                            date_testing.loc[i,'# Indices Prior'] = 4
                
                    #if the 3rd most previous 'Date' is NOT null
                    else:
                        #retrieve the associated 'Date'
                        date_testing.loc[i,'Most Previous Actual Date'] = date_testing.loc[i-3,'Date']
                        #determine how many rows/indices prior it was
                        date_testing.loc[i,'# Indices Prior'] = 3
            
                #if the 2nd most previous 'Date' is NOT null
                else:
                    #retrieve the associated 'Date'
                    date_testing.loc[i,'Most Previous Actual Date'] = date_testing.loc[i-2,'Date']
                    #determine how many rows/indices prior it was
                    date_testing.loc[i,'# Indices Prior'] = 2
            
            #if the most previous 'Date' is NOT null
            else:
                #retrieve the associated 'Date'
                date_testing.loc[i,'Most Previous Actual Date'] = date_testing.loc[i-1,'Date']
                #determine how many rows/indices prior it was
                date_testing.loc[i,'# Indices Prior'] = 1

#Check -- GOOD
    #display(date_testing.iloc[12428:12470,:])
In [11]:
#Do the same but for the immediately SUBSEQUENT 'Date's

#loop through each row in dataset
for i in date_testing.index:
    #for each "/2024" (null) 'Date', find the most immediate subsequent non-null 'Date' & how many rows/indices after it was
    if pd.isna(date_testing.loc[i,'Date']):
        
        #DON'T need to apply a different condition for the last row (index 54758) because it has a real date (not "/2024")
        if i == date_testing.shape[0]-1:
            #there is no following date because there is no subsequent observation
            date_testing.loc[i,'Most Subesequent Actual Date'] = pd.NaT
            date_testing.loc[i,'# Indices After'] = np.nan
        
        #for all other indices:
        else:
            
            #if the most subsequent 'Date' is also null
            if pd.isna(date_testing.loc[i+1,'Date']):
            
                #if the 2nd most subsequent 'Date' is also null
                if pd.isna(date_testing.loc[i+2,'Date']):
                
                    #if the 3rd most subsequent 'Date' is also null
                    if pd.isna(date_testing.loc[i+3,'Date']):
                    
                        #if the 4th most subsequent 'Date' is also null
                        if pd.isna(date_testing.loc[i+4,'Date']):
                        
                            #if the 5th most subsequent 'Date' is also null
                            if pd.isna(date_testing.loc[i+5,'Date']):
                            
                                #if the 6th most subsequent 'Date' is also null
                                if pd.isna(date_testing.loc[i+6,'Date']):
                                    #Based on the function created & used in the previous cell, this should not happen.
                                    date_testing.loc[i,'Most Subsequent Actual Date'] = "PROBLEM"
                                    date_testing.loc[i,'# Indices After'] = 'PROBLEM'
                                #if the 6th most subsequent 'Date' is NOT null
                                else:
                                    #retrieve the associated 'Date'
                                    date_testing.loc[i,'Most Subsequent Actual Date'] = date_testing.loc[i+6,'Date']
                                    #determine how many rows/indices after it was
                                    date_testing.loc[i,'# Indices After'] = 6
                        
                            #if the 5th most previous 'Date' is NOT null
                            else:
                                #retrieve the associated 'Date'
                                date_testing.loc[i,'Most Subsequent Actual Date'] = date_testing.loc[i+5,'Date']
                                #determine how many rows/indices after it was
                                date_testing.loc[i,'# Indices After'] = 5
                        
                        #if the 4th most previous 'Date' is NOT null
                        else:
                            #retrieve the associated 'Date'
                            date_testing.loc[i,'Most Subsequent Actual Date'] = date_testing.loc[i+4,'Date']
                            #determine how many rows/indices after it was
                            date_testing.loc[i,'# Indices After'] = 4
                
                    #if the 3rd most previous 'Date' is NOT null
                    else:
                        #retrieve the associated 'Date'
                        date_testing.loc[i,'Most Subsequent Actual Date'] = date_testing.loc[i+3,'Date']
                        #determine how many rows/indices after it was
                        date_testing.loc[i,'# Indices After'] = 3
            
                #if the 2nd most previous 'Date' is NOT null
                else:
                    #retrieve the associated 'Date'
                    date_testing.loc[i,'Most Subsequent Actual Date'] = date_testing.loc[i+2,'Date']
                    #determine how many rows/indices after it was
                    date_testing.loc[i,'# Indices After'] = 2
            
            #if the most previous 'Date' is NOT null
            else:
                #retrieve the associated 'Date'
                date_testing.loc[i,'Most Subsequent Actual Date'] = date_testing.loc[i+1,'Date']
                #determine how many rows/indices after it was
                date_testing.loc[i,'# Indices After'] = 1

#Check -- GOOD
    #display(date_testing.iloc[12428:12470,:])
In [12]:
display(date_testing.iloc[12428:12450,:])
#display(date_testing.iloc[345:355,:])
#date_testing.info()
Orig Date Date Most Previous Actual Date # Indices Prior Most Subsequent Actual Date # Indices After
12428 August 29/2024 2024-08-29 NaT NaN NaT NaN
12429 /2024 NaT 2024-08-29 1.0 2024-09-06 6.0
12430 /2024 NaT 2024-08-29 2.0 2024-09-06 5.0
12431 /2024 NaT 2024-08-29 3.0 2024-09-06 4.0
12432 /2024 NaT 2024-08-29 4.0 2024-09-06 3.0
12433 /2024 NaT 2024-08-29 5.0 2024-09-06 2.0
12434 /2024 NaT 2024-08-29 6.0 2024-09-06 1.0
12435 September 06/2024 2024-09-06 NaT NaN NaT NaN
12436 9/7/2024 2024-09-07 NaT NaN NaT NaN
12437 9/8/2024 2024-09-08 NaT NaN NaT NaN
12438 9/9/2024 2024-09-09 NaT NaN NaT NaN
12439 9/10/2024 2024-09-10 NaT NaN NaT NaN
12440 /2024 NaT 2024-09-10 1.0 2024-09-12 1.0
12441 September 12/2024 2024-09-12 NaT NaN NaT NaN
12442 9/13/2024 2024-09-13 NaT NaN NaT NaN
12443 9/14/2024 2024-09-14 NaT NaN NaT NaN
12444 /2024 NaT 2024-09-14 1.0 2024-09-16 1.0
12445 9/16/2024 2024-09-16 NaT NaN NaT NaN
12446 9/17/2024 2024-09-17 NaT NaN NaT NaN
12447 9/18/2024 2024-09-18 NaT NaN NaT NaN
12448 9/19/2024 2024-09-19 NaT NaN NaT NaN
12449 September 20/2024 2024-09-20 NaT NaN NaT NaN

Now that the necessary variables have been constructed, they can be used to determine if the vague dates ("/2024") are potentially interpretable or not. The steps to create the code are listed below. 1) For each row, check whether 'Date' is null or not ('Date' = "/2024").

2) Determine if missing dates are interpretable or not using the 4 new variables.

- Find the difference (in days) between the 'Most Subsequent Actual Date' & 'Most Previous Actual Date', & then sum the '# Indices Prior' & '# Indices After'. If these two values are equal, then the missing 'Date' is interpretable.
    - E.g. In indices 12460:12470, the two consecutive missing dates ARE interpretable because there are two missing values & two dates in between the most adjacent non-missing dates of 2024-10-04 & 2024-10-07 (10-05, 10-06).
- If these numbers are NOT equal, then the missing 'Date' is not interpretable, & it should be left alone as pd.NaT (a null timestamp).
    - E.g. In indices 12428:12420, the six consecutive missing dates ARE NOT interpretable because there are six missing values but seven dates in between the most adjacent non-missing dates of 2024-08-29 & 2024-09-06 (08-30, 08-31, 09-01, 09-02, 09-03, 09-04, 09-05).
- Also, a special case needs to be examined when dates in the dataset flip from the end of the year (e.g. 12-31) back to the beginning of the year (e.g. 01-01). The arithmetic needs to be a little more sophisticated because pandas will naturally think that the date following 2024-12-31 is 2025-01-01; however, in this case, the dates are all in 2024 & so the code will think the difference between dates of 12-31 & 01-01 is a full year instead of a single day. This can be solved by adding a pandas.DateOffest of one year so that the arithmetic doesn't think that a whole year differentiates such dates. This is illustrated in the code below.

Additionally, the very first data point in the dataset has a 'Date' of "/2024," but based on the following values, it is reasonable to presume that it is 2024-01-01. This needs to be directly indicated in the code.

In [13]:
#Create another variable indicating whether a missing 'Date' is interpretable or not using the logic above
date_testing['Missing Date Interpretable?'] = '-'

#retrieve indices with missing 'Date's (in other words: "/2024")
for i in [row for row in date_testing.index if pd.isna(date_testing.loc[row,'Date'])]:
    #need to consider the special case at index 0, which is easily self-interpretable given the following 'Date's (2024-01-01)
    if i == 0:
        date_testing.loc[i,'Missing Date Interpretable?'] = 'Yes'
        
    #for the rest of the indices
    else:
        #determine if the missing 'Date' is interpretable or not using appropriate mathematical logic
        if (date_testing.loc[i,'Most Subsequent Actual Date'] - date_testing.loc[i,'Most Previous Actual Date']).days == (
            date_testing.loc[i,'# Indices Prior'] + date_testing.loc[i,'# Indices After']):
            #if these values are equal, the missing 'Date's are interpretable
            date_testing.loc[i,'Missing Date Interpretable?'] = 'Yes'
        
        #consider the special case when the dates changed from the end of the year (e.g. 12-31) back to the beginning (01-01). 
            #adapt the previous logic to get the (chronological) difference in dates (in days) irrespective of year
        elif ((date_testing.loc[i,'Most Subsequent Actual Date'] + pd.DateOffset(years=1)) - date_testing.loc[i,
            'Most Previous Actual Date']).days == (
            date_testing.loc[i,'# Indices Prior'] + date_testing.loc[i,'# Indices After']):
            #if these values are equal, the missing 'Date's are interpretable
            date_testing.loc[i,'Missing Date Interpretable?'] = 'Yes'
        
        else:
            #otherwise, the missing 'Date's are not interpretable
            date_testing.loc[i,'Missing Date Interpretable?'] = 'No'

#check code in a different cell -- GOOD
In [14]:
#Check the code from the previous cell -- GOOD

display(date_testing.iloc[:,-1].value_counts(dropna=False))
    # '-'   45,557
    # 'Yes' 8,663
    # 'No'  539
        #indices: 144, 145, 152, 153, 154, ..., 54736, 54750, 54743, 54746, 54757

#date_testing[date_testing['Missing Date Interpretable?'] == 'No']
#date_testing.iloc[54735:54760,:]

#Retrieve indices where date changes from end of year to the beginning to check whether code is correct or not
year_change_indices = date_testing[date_testing['Most Previous Actual Date'] > date_testing[
    'Most Subsequent Actual Date']].index.tolist()

#date_testing.iloc[year_change_indices[108]-7:year_change_indices[108]+7, :]
Missing Date Interpretable?
-      45557
Yes     8663
No       539
Name: count, dtype: int64

Of the 54,759 observations in the dataset, 9,202 of them have an unclear 'Date' ("/2024"). Fortunately, most of these 'Date's, by the logic used in the cell above, have an interpretable date based on those adjacent to them (8,663 of them). The next step involves interpolating these interpretable dates. Those dates that are not interpretable will remain unchanged.

It should be pointed out that the methods & logic used here are not completely foolproof, but it is useful enough to redeem many more observations in the dataset that have a vague 'Date' value. The code could be expanded upon to make interpreting the unclear dates even more regulated & consistent, but what has been done thus far is adequate for the purposes of this project. This is demonstrated in that the number of potential data points that may not be included in analyses dropped from 9,202 to only 539 (from about 16.8% of the dataset to less than one percent).

The next step involves utilizing this data to interpolate 'Date's of observations with missing 'Date's for those that are deemed interpretable. Then, the new dates can be inserted into the main dataset.

In [15]:
#Assign the missing dates that are deemed interpretable. Create another variable indicating the interpolated 'Date'.
date_testing['Interpolated Date'] = pd.NaT

#loop through the relevant data points
for i in [row for row in date_testing.index if date_testing.loc[row,'Missing Date Interpretable?'] == 'Yes']:
    
    #consider special case for the first data point which is reasonably interpretable based on the subsequent dates:
    if i == 0:
        date_testing.loc[i,'Interpolated Date'] = pd.to_datetime("2024-01-01")
    
    #for all other indices
    else:
        
        #consider special cases where date changes from end of year to beginning
        if date_testing.loc[i,'Most Previous Actual Date'] > date_testing.loc[i,'Most Subsequent Actual Date']:
            date_testing.loc[i,'Interpolated Date'] = date_testing.loc[i,'Most Previous Actual Date'] + pd.DateOffset(
                years=-1, days=date_testing.loc[i,'# Indices Prior'])
        
        #for other indices
        else:
            date_testing.loc[i,'Interpolated Date'] = date_testing.loc[i,'Most Previous Actual Date'] + pd.DateOffset(
                days=date_testing.loc[i,'# Indices Prior'])

#Check -- GOOD
#date_testing.iloc[:,-1].value_counts().sum() == 8,663
#date_testing.iloc[12428:12470,:]
In [16]:
#Rename the current formatted 'Date' column
df_I_A = df_I_A.rename(columns={'Date':'Orig Date (formatted)'})

#Insert the new dates into the main dataset
df_I_A['Interpolated Date'] = date_testing['Interpolated Date']

#Create a new variable that merges the original non-vague dates & the interpolated dates (for vague dates) together.
    #Use the pandas.combine_first() function
df_I_A['New Date'] = df_I_A['Orig Date (formatted)'].combine_first(df_I_A['Interpolated Date'])

#Check the result -- GOOD
#display(df_I_A[pd.isna(df_I_A['Orig Date (formatted)']) & pd.isna(df_I_A['Interpolated Date'])])
    #539 rows

#Reorganize the columns
df_I_A = df_I_A[['Orig Date','Orig Date (formatted)','Interpolated Date','New Date','Source','Site ID','POC',
                 'Daily Max 8-hour Ozone Concentration','Units','Daily AQI Value','Local Site Name','Daily Obs Count',
                 'Percent Complete','Method Code','CBSA Code','CBSA Name','County FIPS Code','County','Site Latitude',
                 'Site Longitude']]

#display(df_I_A.info())

To be clear, in further analyses, the 'New Date' variable should be used if time were to be taken into account. Note that there are still 539 data points with a vague or missing date ("/2024") in this variable. Since they account for less than one percent of the dataset, it is okay to drop them in later analyses.

Now that the data has been cleaned in terms of inconsistencies & missing values, duplicates & outliers need to be handled. The former was dealt with first.

In [17]:
#Evaluate how many duplicated data points there are. It helps to sort the dataset in order to see duplicate data points easier.

#Check the original dataset
orig_data[orig_data.duplicated(keep=False)].sort_values(by=orig_data.columns.tolist())
    #5,811 total rows
orig_data[orig_data.duplicated(keep='first')].sort_values(by=orig_data.columns.tolist())
    #3,576 rows; there are 3,576 data points that have at least 1 duplicate


#Evaluate most recently cleaned data with interpolated dates
df_I_A[df_I_A.duplicated(keep=False)].sort_values(by=df_I_A.columns.tolist())
    #421 rows
df_I_A[df_I_A.duplicated(keep='first')].sort_values(by=df_I_A.columns.tolist())
    #211 rows; there are 211 data points that have at least 1 duplicate


#Remove duplicates from new data
df_I_B = df_I_A.drop_duplicates()
    #54,759 rows --> 54,548 (211 row difference)
df_I_B.info()
<class 'pandas.core.frame.DataFrame'>
Index: 54548 entries, 0 to 54758
Data columns (total 20 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   Orig Date                             54548 non-null  object        
 1   Orig Date (formatted)                 45364 non-null  datetime64[ns]
 2   Interpolated Date                     8663 non-null   datetime64[ns]
 3   New Date                              54027 non-null  datetime64[ns]
 4   Source                                54548 non-null  object        
 5   Site ID                               54548 non-null  int64         
 6   POC                                   54548 non-null  int64         
 7   Daily Max 8-hour Ozone Concentration  51810 non-null  float64       
 8   Units                                 54548 non-null  object        
 9   Daily AQI Value                       51810 non-null  float64       
 10  Local Site Name                       54548 non-null  object        
 11  Daily Obs Count                       54548 non-null  int64         
 12  Percent Complete                      54548 non-null  int64         
 13  Method Code                           48089 non-null  float64       
 14  CBSA Code                             52154 non-null  float64       
 15  CBSA Name                             52154 non-null  object        
 16  County FIPS Code                      54548 non-null  int64         
 17  County                                54548 non-null  object        
 18  Site Latitude                         54548 non-null  float64       
 19  Site Longitude                        54548 non-null  float64       
dtypes: datetime64[ns](3), float64(6), int64(5), object(6)
memory usage: 8.7+ MB

In the original dataset, there are 5,811 duplicated data points across all 17 variables in the dataset. There are 3,576 unique data points that have at least one duplicate. If these duplicates were to be removed, it would drop the size of the dataset from 54,759 observations to 51,183 (about 6.5%). However, with the newly updated data, particularly in terms of the dates, there were only 421 complete duplicates across all of the variables in this dataset. There are 211 unique data points that have at least one duplicate. So, the size of the dataset dropped from 54,759 data points to 54,548 (about 0.38%). Through the interpolation of most of the dates from the original dataset, it helped to preserve a great number of data points which could be used for future analyses.

With this updated dataset of 54,548 data points, there were six variables (including only one 'Date' variable) that have missing values. The 'Method Code' variable has the most missing values of these six variables--48,089 (~12% of its values are missing).

Finally, the last part of the cleaning process involved checking the data for outliers. In other words, across each variable were there any strange or unusual values that didn't make contextual sense given the nature of the dataset. There are official mathematical formulas that can be used to check numerical variables for outliers, but it is important to consider the nature of each variable when checking for outliers. On top of this, such a mathematical formula will not help find outliers in non-numerical variables. For instance, certain values of a variable that are present in only a very small number of observations in the dataset could potentially be outliers. Sometimes, more knowledge of the dataset & the relevant variables might be required to truly determine whether a potential outlier belongs or not.

  • As an example, consider the 'Percent Complete' variable whose integer values indicate whether all expected samples (of the air quality) were collected or not. Since this variable describes a percentage, the values should lie within zero & 100. If an observation had a value of 127 percent, this would be considered an outlier (or potentially even an incorrect value) because it doesn't make sense given the contextual nature of the variable. Given how many times such a value was present in the dataset would help dictate whether the corresponding data point(s) would be removed or not.

To evaluate the data for outliers, the values of each variable were obtained as well as how many times they were present in the dataset. In this process, there was only one instance in which a variable had some strange values. Specifically, the 'Daily Obs Count' variable, which quantifies the number of observations reported in a certain day, 54,001 of the 54,548 data points have an observation count between one & 24. The remaining 547 data points have a 'Daily Obs Count' of 1,000 which is undoubtedly markedly different than the other values in the variable. On the other hand, 547 data points (about 1.0% of the dataset) is still a considerable number of observations. As such, these data points were investigated further to see if any other variables had strange values as well.

  • Unfortunately, further analysis across the other variables associated with these data points revealed no indicative or meaningful patterns that might suggest that these data points are outliers.
  • At this point, it would be helpful to have more knowledge about the dataset & how these variables were measured. For example, is it realistic for these monitoring stations, which recorded the data, to have taken 1,000 observations in a day when they usually only recorded 24 or less? Alternatively, did these values arise due to some sort of error with the monitoring system or some other integration process/system that recorded & inputted these values. Many questions could be asked regarding these values of the 'Daily Obs Count' variable, but more research would be required in order to get to the bottom of them.

Given that there is not enough information or context to conclude that 'Daily Obs Count' values of 1,000 are outliers, they & their associated data points were not altered in any way.

In [18]:
#Evaluate the strange 1,000 values in the 'Daily Obs Count' variable

display(df_I_B['Daily Obs Count'].value_counts(dropna=False).sort_index())

#investigate the strange 'Daily Obs Count' values (1000)
    #only 8 missing 'New Date's --> not meaningful
    #'Method Code' has 72 missing values; 344 "87", 81 "47" --> not meaningful
    #'Daily AQI Value' has 37 missing values. next most value ("31") pertains to 30 data points. --> maybe meaningful?
    #'CBSA Code' has 25 missing values; most popular value is "40140" (72 times) --> not meaningful

#df_I_B[df_I_B['Daily Obs Count'] > 25]['County'].value_counts(dropna=False)
Daily Obs Count
1           1
2           1
4           2
5           1
6           1
7           1
8           3
9           2
10          4
11          6
12         12
13        407
14        467
15        384
16        181
17      46143
18        181
19        201
20         24
21         50
22         46
23         34
24       5849
1000      547
Name: count, dtype: int64

After these cleaning steps were performed, the resulting dataset had 521 observations with missing 'New Date's (which originally had 'Date's of "/2024"). Because these data points don't have a clear, specific date, they could be removed from the dataset. This was the last step performed in this Analysis I section.

In [19]:
#Drop observations with missing 'New Date's (521)
df_I_C = df_I_B.dropna(subset=['New Date'])

#Rename the cleaned version of the data that will be used going forward
clean_df = df_I_C.copy()

Analysis I - Conclusions¶

Exploratory Data Analysis & data cleaning.

Of the 17 variables in the dataset, three of them required cleaning--'Date,' 'Local Site Name,' & 'County.' All three had some inconsistencies regarding how values were formatted & such, but the 'Date' variable was the "dirtiest" by far.

  • There were many missing values in the 'Date' column that were able to be interpolated which saved about 8,600 data points--the number of total observations with missing/vague 'Date's dropped from 9,202 to 539. Even though this process was a little rudimentary, it was still reasonable & quite effective for the purposes of this project.
  • Furthermore, some data points were duplicated. Upon their removal, the size of the dataset decreased by another 211 observations.
  • When it came to outliers, the 'Daily Obs Count' variable has one strange value. There were 547 instances of 1,000 although the remaining data points had a value between one & 24. Ultimately, there wasn't enough information to determine whether these values were appropriate & if they were outliers, so they were left unaltered.

Recall that the original dataset had 54,759 observations & 17 variables. The cleaned version of the dataset, defined as clean_df, has 54,027 total data points & 20 variables, though four of them pertain to the date. Going forward, the 'New Date' date variable is the one that should be used when including time in analyses. In the clean_df dataset, this variable has zero missing values.


Analysis II¶

How does daily maximum 8-hour ozone concentration vary over time and regions?

The state of California is the third largest state in the US by area--about 164,000 square miles or 424,000 square kilometers. In terms of latitude, California spans between about 32.5 to 42 degrees North of the equator. On the other hand, the state's longitudinal range spans between about 114.5 & 125 degrees West of the equator. Based on 2024 estimates, California was the most populated US state in 2024--about 39.5 million people. Moreover, Los Angeles, on the southwestern coast of California, is the second most populated city in the US--just under four million residents. The greater the ozone concentration levels in the atmosphere, the worse the potential consequences could be for humans & other species affected. For example, a consequence of (relatively) high ozone levels can result in increased amounts of smog, as observed in the cover image at the beginning of this project.

By analyzing how ozone levels evolved in California throughout 2024, it could indicate what regions are more exposed to greater levels of harmful ozone pollution. In this dataset, there are three pairs of variables that pertain to the geography of California--the 'CBSA' variables, the 'County' variables, & the latitude & longitude variables. The former two pairs of variables are quite similar in that they define & outline areas in the state based on metrics like population, administrative reasons, & other economic factors. According to Wikipedia, there are 35 core-based statistical areas (CBSA's) in California as of 2023, but there are only 32 unique areas in the dataset.

In contrast to core-based statistical areas, there are 58 counties in California according to the California State Association of Counties. In the dataset, there are 48 unique counties. As such, counties are slightly more specific when it comes to outlining areas in California than core-based statistical areas. Given that this section is interested in evaluating geographical areas based on environmental details, as opposed to financial or administrative factors, counties were chosen as the "regions" rather than CBSA's.

When it comes to the time aspect of this section, different levels could be explored depending on the level of depth that is desired. In this case, the data was analyzed on a monthly basis. Since the data is being evaluated over time, it will result in data points being aggregated together into a single value per month. Therefore, it is important to assess which metrics are most appropriate for evaluating the ozone levels over time. This can be done by exploring the distribution of values in the 'Daily Max 8-hour Ozone Concentration' variable. If the data is significantly skewed (i.e. the distribution is considerably asymmetrical or not "normal"), it indicates that outliers exerted a significant influence on the mean of the variable. In such cases, the median would be more appropriate than the mean. If the distribution is not significantly skewed, than the mean is fine to use. Identifying what metric to use is the first step in this section.

When evaluating how ozone concentrations evolved over time & regions, there are a few different figures that can effectively visualize such changes on a geographical space including density heatmaps & choropleths. Conversely, a generic line plot that visualizes average monthly ozone levels per county in 2024 would not be appropriate because of how many unique counties there are. Such a plot would be way overly congested, so more sophisticated plots will be more effective.

  • More specifically, choropleths, geographical scatter plots, & geospatial or density heatmaps were used to visualize ozone levels by county over time. Nonetheless, to start this process, a more general analysis of how ozone levels behaved over the entirety of 2024 throughout California can be evaluated & visualized. A simple line plot would suffice.

To expand upon the analysis, the differences in minimum & maximum ozone levels for different areas can be obtained. This would indicate which California regions experienced the most & the least variance in ozone levels in 2024.

It should be pointed out that some variables relevant to this section still have missing values. More specifically, there are about 2,700 & 2,350 missing values in the 'Daily Max 8-hour Ozone Concentration' variable & the pair of 'CBSA' variables repectively. Fortunately, these missing values only correspond to about five percent of the total observations in the dataset. As such, it is reasonable to exclude them from any calculations if necessary.

In [20]:
#Evaluate distribution of 'Daily Max 8-hour Ozone Concentration' values

#obtain summary metrics to compare mean & median
display("Average 8-hour ozone level in California in 2024: {} ppm.".format(np.round(
    clean_df['Daily Max 8-hour Ozone Concentration'].mean(), 4)))
display("Median 8-hour ozone level in California in 2024: {} ppm.".format(
    clean_df['Daily Max 8-hour Ozone Concentration'].median()))

#plot the data
ax1 = sns.histplot(data=clean_df, x='Daily Max 8-hour Ozone Concentration', kde=True)
#append the mean value
ax1.axvline(x=clean_df['Daily Max 8-hour Ozone Concentration'].mean(), color='red', linestyle='--', alpha=0.5)

#edit plot details
plt.title('Distribution of Ozone Concentration Levels - 2024')
plt.xlabel('Daily Max 8-hour Ozone Concentration (ppm)')
plt.tight_layout();
'Average 8-hour ozone level in California in 2024: 0.0435 ppm.'
'Median 8-hour ozone level in California in 2024: 0.041 ppm.'

The figure above shows the distribution of values of the 'Daily Max 8-hour Ozone Concentration' variable. In other words, this plot shows the distribution of ozone levels in California throughout 2024. Recall that this variable quantifies the highest eight-hour ozone value for a given day. This variable, corresponding to the horizontal axis, is in units of parts per million (ppm) which describes how much ozone is in the air. The vertical, red line indicates the mean value of the variable which is about 0.044.

The EPA's general eight-hour standard, in terms of what level of ozone is deemed safe, is 0.070 ppm. Higher levels risk greater exposure to health issues & amounts of air pollution. From this figure, it is evident that most data points in California recorded a ozone concentration less than this value. More specifically, about 2,600 data points in the dataset recorded a ozone concentration level higher than 0.070 ppm, which corresponds to about five percent of the total observations.

Although the distribution of the 'Daily Max 8-hour Ozone Concentration' variable is slightly skewed to the right, the mean & median values are still quite comparable as they differ by about 0.0025 ppm. As such, it was deemed appropriate to use the average for the remainder of this analysis.

To begin, procuring an elementary understanding of how ozone levels developed in California in 2024 can help provide a basis of what to expect & compare against later in the analysis. To do this, a simple line plot can be made visualizing the average monthly ozone levels across all Californian counties.

In [21]:
#Focus on relevant variables
data_II = clean_df.dropna(subset=['Daily Max 8-hour Ozone Concentration'])[['New Date','Daily Max 8-hour Ozone Concentration',
                    'CBSA Code','CBSA Name','County FIPS Code','County','Site Latitude','Site Longitude']]
display(data_II.info())
<class 'pandas.core.frame.DataFrame'>
Index: 51309 entries, 0 to 54755
Data columns (total 8 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   New Date                              51309 non-null  datetime64[ns]
 1   Daily Max 8-hour Ozone Concentration  51309 non-null  float64       
 2   CBSA Code                             49057 non-null  float64       
 3   CBSA Name                             49057 non-null  object        
 4   County FIPS Code                      51309 non-null  int64         
 5   County                                51309 non-null  object        
 6   Site Latitude                         51309 non-null  float64       
 7   Site Longitude                        51309 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(1), object(2)
memory usage: 3.5+ MB
None
In [22]:
#Obtain avg ozone levels per month
mth_oz_avgs = data_II.groupby(data_II['New Date'].dt.month)['Daily Max 8-hour Ozone Concentration'].mean().reset_index().rename(
                columns={'New Date':'Month'})

#Plot average monthly ozone levels across all CA counties in 2024
plt.figure(figsize=(7,4))
ax1 = sns.lineplot(data=mth_oz_avgs, x='Month', y='Daily Max 8-hour Ozone Concentration')

#edit plot details
plt.title('Monthly Average Ozone Levels - California 2024')
plt.xticks(fontsize=9, ticks=range(1,13,1))
plt.ylabel('Average Daily Max 8-hour Ozone Concentration (ppm)', fontsize=9)
plt.yticks(fontsize=8, ticks=np.arange(0,0.061,0.01))
plt.grid(axis='both',alpha=0.2,color='gray')
plt.tight_layout();

The figure above illustrates the average monthly ozone concentration levels in California throughout 2024. The vertical axis is in units of parts per million (ppm). Recall that the EPA's general eight-hour standard, in terms of what level of ozone is deemed safe, is 0.070 ppm. Higher levels risk greater exposure to health issues & amounts of air pollution.

On a monthly basis, ozone levels in California during 2024 peaked in July & were at their lowest in December. Fortunately, no month ever exhibited an average ozone level at or higher than the EPA's eight-hour standard of 0.070 ppm.
If a figure of average monthly temperatures was plotted alongside this figure, they would likely look very similar. This would indicate that the two variables--temperature & ozone concentration levels--are positively correlated, which is logical given that ozone levels rise when there is more sunlight & energy in the atmosphere as is the case during the peak summer months like June & July.

Ultimately, this figure provides a basis for how to evaluate ozone concentrations on a county level in the rest of this section. Generally, ozone levels will be expected to peak in late spring & early summer (i.e. June & July) & bottom out in early winter (i.e. December & January). By comparing counties' monthly ozone averages to this figure, it could indicate which areas exhibited higher & lower levels of ozone during 2024.


To superpose a map of California counties onto a plottable figure (a choropleth in this case), a GeoJSON--a format for encoding geographical data structures--is required. A file named california-counties.geojson was obtained from GitHub that contains the necessary components for constructing the desired figures. This GeoJSON data was imported with the help of the geopandas module & defined as the variable gdf.
Moreover, this GeoJSON data contains all 58 counties in California, & so it helps to easily reveal which counties are missing from the original dataset.

Furthermore, a plotly.express choropleth requires the specific GeoJSON format structure to visualize the California counties. The geographical scatter plots can operate off of latitudes & longitudes provided it is supplied a "scope" which tells the mapping function what geograpic region the figure should pertain to. The density/geospatial heatmaps can also operate off of latitudes & longitudes.

With this in mind, two different formats of data aggregation need to be organized. One needs only to include California counties, whereas the other needs to also incorporate the 'Site Latitude' & 'Site Longitude' variables. In the original dataset, different counties span multiple latitudes & longitudes, so there might be multiple data points associated with unique counties in the latter aggregation results.

To start, these analyses were done without incorporating time. In other words, average ozone levels were obtained per county.

In [23]:
#import CA-county template data to help with plotting
gdf = gpd.read_file("california-counties.geojson").rename(columns={'name':'County'})
gdf['County'] = gdf['County'].str.title()

#extract all CA counties bc they aren't all in the original dataset
all_CA_counties = gdf.County.tolist()
In [24]:
#Calculate avg Ozone concentration per County
cty_oz_avgs_I = data_II.groupby('County')['Daily Max 8-hour Ozone Concentration'].mean().reset_index()
#Calculate avg Ozone concentration per County, Latitude, Longitude
cty_latlon_oz_avgs_I = data_II.groupby(['County','Site Latitude',
                    'Site Longitude'])['Daily Max 8-hour Ozone Concentration'].mean().reset_index()

#COUNTY-ONLY DATA
#Merge CA-county template data & avgs to retrieve all counties
cty_oz_avgs_II = gdf.set_index('County').join(cty_oz_avgs_I.set_index('County'), how='left', 
                            on='County').sort_values('County').reset_index()
#Add a variable that imputes missing Ozone concentrations with 0
cty_oz_avgs_II['Imputed Ozone Conc.'] = cty_oz_avgs_II['Daily Max 8-hour Ozone Concentration'].fillna(0)
#Create a variable for hover_info in the plotly.express figures to indicate whether a county had missing Ozone data or not
cty_oz_avgs_II['Ozone Info'] = cty_oz_avgs_II.apply(lambda row: 
                            f"{row['County']}: {row['Imputed Ozone Conc.']:.3f} ppm" +
                            (" (No data — filled with 0)" if row['Imputed Ozone Conc.'] == 0 else ""), axis=1)

#COUNTY + LATITUDE/LONGITUDE DATA
#Merge CA-county template data & avgs to retrieve all counties
cty_latlon_oz_avgs_II = gdf.set_index('County').join(cty_latlon_oz_avgs_I.set_index('County'), how='left', 
                                                      on='County').sort_values('County').reset_index()
#Add a variable that imputes missing Ozone concentrations with 0
cty_latlon_oz_avgs_II['Imputed Ozone Conc.'] = cty_latlon_oz_avgs_II[
                        'Daily Max 8-hour Ozone Concentration'].fillna(0)
#Create a variable for hover_info in the plotly.express figures to indicate whether a county had missing Ozone data or not
cty_latlon_oz_avgs_II['Ozone Info'] = cty_latlon_oz_avgs_II.apply(lambda row: 
                            f"{row['County']}: {row['Imputed Ozone Conc.']:.3f} ppm" +
                            (" (No data — filled with 0)" if row['Imputed Ozone Conc.'] == 0 else ""), axis=1)

display(cty_oz_avgs_II.head(2))
County cartodb_id created_at updated_at geometry Daily Max 8-hour Ozone Concentration Imputed Ozone Conc. Ozone Info
0 Alameda 1 2015-07-04 21:04:58+00:00 2015-07-04 21:04:58+00:00 MULTIPOLYGON (((-122.31293 37.89733, -122.2884... 0.031304 0.031304 Alameda: 0.031 ppm
1 Alpine 2 2015-07-04 21:04:58+00:00 2015-07-04 21:04:58+00:00 POLYGON ((-120.07239 38.70277, -119.96495 38.7... NaN 0.000000 Alpine: 0.000 ppm (No data — filled with 0)
In [25]:
#Create a choropleth on CA-counties showing avg Ozone levels in 2024
fig = px.choropleth(
    cty_oz_avgs_II,
    geojson=cty_oz_avgs_II.geometry.__geo_interface__,
    locations=cty_oz_avgs_II.index,
    hover_name='Ozone Info',
    color='Imputed Ozone Conc.',
    range_color=[0,0.065],
    #max avg Ozone level was ~0.055
    color_continuous_scale="OrRd",
    title="Average Ozone Concentration by County (California, 2024)")

#edit plot details
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(title_x=0.5, width=900, height=500, margin=dict(l=20,r=10,t=30,b=55))
fig.update_coloraxes(colorbar_title="Average 8-hour Ozone Concentration (ppm)")
fig.show()

The choropleth above displays the average eight-hour ozone concentration levels across the 58 Californian counties in 2024. The ozone levels, which are in units of parts per million (ppm), are illustrated through the different colors on the geographical map. Note that the original dataset did not have data for ten of the counties. These are denoted with the color white because they were imputed with an average ozone value of zero. Moreover, upon hovering over such counties, it is denoted directly that they had no ozone data.

Recall that the EPA daily eight-hour standard of what ozone concentration is deemed "safe" is at 0.07 ppm. Fortunately, no county saw an average value at this level (or higher) in 2024, but this does not mean that ozone levels never exceeded this standard throughout the year. Generally, average ozone levels in California were between 0.03 & 0.055 ppm in 2024.

Throughout 2024, the San Bernardino county had the highest average ozone level in California in 2024 (~ 0.055 ppm). It is also the largest county by size. Generally, it appears that the southernmost & central Californian counties experienced higher ozone levels on average during 2024. On the other hand, the counties along the western coast, particularly the northern two-thirds of the Californian coast, experieneced lower ozone levels on average during 2024. The northern counties in California might also experience lower ozone levels on average, but there are too many northern Californian counties with missing data to confirm so either way.

In [26]:
#Create a geographic scatterplot of avg Ozone levels per 'County' in 2024 using 'Latitude's/'Longitude's
fig = px.scatter_geo(cty_latlon_oz_avgs_II,
    lat='Site Latitude',
    lon='Site Longitude',
    color='Daily Max 8-hour Ozone Concentration',
    hover_name='County',
    scope='usa',
    title='Average Ozone Concentration Levels Across California (2024)',
    color_continuous_scale='OrRd',
    #max avg Ozone level ~0.061
    range_color=[0,0.06],
    template='plotly_white')

#edit plot details
fig.update_layout(geo=dict(center=dict(lat=37, lon=-120)), title_x=0.5, width=900, height=500)
fig.update_coloraxes(colorbar_title="Average 8-hour Ozone Concentration (ppm)")
fig.show()

The figure above is a geographic scatter plot which displays average ozone levels per Californian county in 2024. Additionally, the geographical data is broken down by longitude & latitude so the dots on the map might shift around for various counties. This plot mostly displays the same information as the previous choropleth with the main difference being how location is displayed.

As observed in the choropleth previously, higher average ozone levels seem to be present in southern & central California. On the other hand, counties along the western coastline exhibit lower average ozone levels, particularly in the northern two-thirds to three-quarters of the state. In this case, the location with the highest average ozone level was in Tulare county (~0.061), which is in the central & slightly southern part of California.

Next, time was incorporated. More specifically, average ozone levels were calculated per month & location.

In [27]:
#NOW INCORPORATE TIME

#Add month variable to data
data_II['Month'] = data_II['New Date'].dt.month

#Create a dataframe containing all combinations of all CA counties & months (58*12 = 696 pairs)
mth_CA_cty_pairs = pd.DataFrame(list(itertools.product(list(range(1,13,1)), all_CA_counties))).rename(
                        columns={0:'Month',1:'County'})

#Get avg Ozone concentration per month-county pair
mth_cty_oz_avgs_I = data_II.groupby(['Month','County'])['Daily Max 8-hour Ozone Concentration'].mean().reset_index()
#Get avg Ozone concentration per month-county-latitude-longitude group
mth_cty_latlon_oz_avgs_I = data_II.groupby(['Month','County','Site Latitude','Site Longitude'])[
    'Daily Max 8-hour Ozone Concentration'].mean().reset_index()


#COUNTY-ONLY DATA
#Merge all CA-county-month pairs & avgs to include pairs with no data
mth_cty_oz_avgs_II = mth_cty_oz_avgs_I.merge(mth_CA_cty_pairs, on=['Month','County'], how='outer').sort_values(['Month',
                                                                    'County']).reset_index(drop=True)
#Attach GeoJSON template data
mth_cty_oz_avgs_II = mth_cty_oz_avgs_II.set_index('County').join(gdf.set_index('County'), on='County', how='left').sort_values(
                    ['Month','County']).reset_index()
#Add a variable that imputes missing Ozone concentrations with 0
mth_cty_oz_avgs_II['Imputed Ozone Conc.'] = mth_cty_oz_avgs_II['Daily Max 8-hour Ozone Concentration'].fillna(0)
#Create a variable for hover_info in the plotly.express figures to indicate whether a county had missing Ozone data or not
mth_cty_oz_avgs_II['Ozone Info'] = mth_cty_oz_avgs_II.apply(lambda row: 
                            f"{row['County']}: {row['Imputed Ozone Conc.']:.3f} ppm" +
                            (" (No data — filled with 0)" if row['Imputed Ozone Conc.'] == 0 else ""), axis=1)

#Create another variable indicating the 'Month' name instead of number
mth_cty_oz_avgs_II['Month Name'] = pd.to_datetime(mth_cty_oz_avgs_II['Month'], format='%m').dt.month_name()


#COUNTY + LATITUDE/LONGITUDE DATA
#Don't bother merging all CA-county pairs & Ozone avgs bc it doesn't incorporate latitude & longitude per CA-county pair. Is 
    #more work than necessary given the ultimate plot this data produces.
#Attach GeoJSON template data to the avgs
mth_cty_latlon_oz_avgs_II = gdf.set_index('County').join(mth_cty_latlon_oz_avgs_I.set_index('County'), how='left', 
                                                      on='County').sort_values(['Month','County']).reset_index()
#Drop month-location data points w/missing Ozone data
mth_cty_latlon_oz_avgs_II = mth_cty_latlon_oz_avgs_II.dropna(subset=['Daily Max 8-hour Ozone Concentration'])
    #size of dataset goes from 1,879 --> 1,869 rows

#Create another variable indicating the 'Month' name instead of number
mth_cty_latlon_oz_avgs_II['Month Name'] = pd.to_datetime(mth_cty_latlon_oz_avgs_II['Month'], format='%m').dt.month_name()
In [28]:
#Create a choropleth on CA-counties showing avg Ozone levels per month in 2024
fig = px.choropleth(
    mth_cty_oz_avgs_II,
    geojson=gdf.__geo_interface__,
    locations='County',
    featureidkey='properties.County',
    animation_frame='Month Name',
    hover_name='Ozone Info',
    color='Imputed Ozone Conc.',
    #max avg Ozone level ~ 0.082
    range_color=[0,0.085],
    color_continuous_scale="OrRd",
    title="Monthly Average Ozone Concentration by County (California, 2024)")

#edit plot details
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(title_x=0.5, width=900, height=500, updatemenus=[{'buttons':[{'args':[None, {'frame':{'duration':1400}}]}]}])
fig.update_coloraxes(colorbar_title="Average 8-hour Ozone Concentration (ppm)")
fig.show()

The choropleth above displays monthly average eight-hour ozone concentration levels across the 58 Californian counties in 2024. The ozone levels, which are in units of parts per million (ppm), are illustrated through the different colors on the geographical map. Note that the original dataset did not have data for ten of the counties. These are denoted with the color white because they were imputed with an average ozone value of zero. Moreover, upon hovering over such counties, it is denoted directly that they had no ozone data.

Recall that the EPA daily eight-hour standard of what ozone concentration is deemed "safe" is at 0.07 ppm. On average, this value is only reached or surpassed five times & by three unique counties in 2024. San Bernardino had average ozone levels between 0.071-0.082 ppm in June, July, & August. Kern & Tulare also had an average ozone level above 0.07 ppm in July. In other words, ozone levels in California were the most harmful in these three aforementioned counties between June & August. Interestingly, these three counties--in the southern & central part of the state--are indirectly adjacent, providing further credence that the southern & central region of California experienced higher levels of ozone than the rest of the state.

In this dataset, average ozone levels peaked in July (at just over 0.04 ppm) & they were smallest in December & January (just below 0.025 ppm). This trend correlates highly with temperature which is logical given that greenhouse gases, like ozone, are more abundant when temperatures are higher & when there is more sunlight. This trend is evident in the choropleth above as it cycles through the 12 months of the year.

In the 12-month animation, the colors of the counties generally darken from January onwards, & they are the most striking in July indicating the peak of average ozone levels throughout California. After July, the colors lighten again until they become their brightest again in December indicating when average ozone levels are at their lowest. Generally, all of the counties in the choropleth follow this pattern, but they are far from consistent. Differences in average ozone levels between different regions in California are the most striking in the late spring & early summer (i.e. June & July).

  • The southern & central counties, like San Bernardino & others mentioned previously, exhibit continuously increasing ozone levels throughout the first seven months of the year. From August onwards, these levels decline at a sharper rate than how they increased.
  • The northern counties (e.g. Siskiyou & Shasta) exhibit a similar pattern but their ozone levels don't reach the same severity as the southern & central counties do in the late spring & early summer.
  • The counties along the western coast, excluding the southernmost coastline counties (e.g. Santa Barbara, Monterey, & Humboldt), exhibit a more turbulent pattern in terms of average ozone levels. They slowly increase from January onwards & peak in April (at levels of around 0.04 ppm). They then fall slightly & remain quite stable for the following few months. In the last couple months of the year, the average ozone levels of these counties again decline slightly to around 0.03 ppm or lower depending on the county.

In summary, counties along the non-southern coast appeared to exhibit the least variance in average ozone levels. Larger counties in the southern & central part of California saw the widest range of ozone levels in 2024. Generally, ozone levels peaked throughout the state in July & they were at their lowest in January.

In [29]:
#Create a geographic scatterplot of avg Ozone levels per 'County' & month in 2024 using 'Latitude's/'Longitude's
fig = px.scatter_geo(mth_cty_latlon_oz_avgs_II,
    lat='Site Latitude',
    lon='Site Longitude',
    color='Daily Max 8-hour Ozone Concentration',
    hover_name='County',
    animation_frame='Month Name',
    scope='usa',
    title='Average Ozone Concentration Levels Across California (2024)',
    color_continuous_scale='OrRd',
    #max avg Ozone level ~ 0.099
    range_color=[0,0.1],
    template='plotly_white')

#edit plot details
fig.update_layout(geo=dict(center=dict(lat=37.3, lon=-123)), title_x=0.5, width=950, height=470, margin=dict(t=35,b=30),
                 updatemenus=[{'buttons':[{'args':[None, {'frame':{'duration':1500}}]}]}])
fig.update_coloraxes(colorbar_title="Average 8-hour Ozone Concentration (ppm)")
fig.show()

The figure above is a geographic scatter plot which displays monthly average ozone levels per Californian county in 2024. Additionally, the geographical data is broken down by longitude & latitude so the dots on the map might shift around for various counties. Given this, the range of average ozone levels vary a bit more than what was observed in the previous choropleth. Otherwise, this plot mostly displays the same information as the previous choropleth with the main difference being how location is displayed.

Recall that the EPA daily eight-hour standard of what ozone concentration is deemed "safe" is at 0.07 ppm. On average, this value is reached &/or surpassed 57 times & by six unique counties in 2024--San Bernardino (21), Riverside (13), Los Angeles (10), Tulare (seven), Kern (four), & Fresno (two). All of these counties, which are indirectly adjacent to each other, are in the southern, central part of California. These above-standard ozone levels, which ranged between 0.070-0.099 ppm, are prevalent from June through September & they are most prominent in July--there were more than twice as many data points in July than in any of the other three months. This is sensible given that average ozone levels peaked in July.

For the most part, the monthly patterns observed & described in the previous choropleth are also evident in this geographic scatter plot. The differences between the three main regions, as outlined below, were the most evident in June, July, & August of 2024.

  • The southern, central parts of California exhibit the most striking evolution in average ozone levels.
  • The northern counties exhibit a similar pattern in that their ozone levels increase from January through July then decline through the rest of the year, but these levels don't climb as dramatically as they do in the more southern counties.
  • Average ozone levels of counties along the northern & central parts of the west coast increase slightly between January & April, but they stabilize for most of the rest of the year before falling slightly in November & December.

In summary, counties along the non-southern coast appeared to exhibit the least variance in average ozone levels. Larger counties in the southern & central part of California saw the widest range of ozone levels in 2024. Generally, ozone levels peaked throughout the state in July & they were at their lowest in January.

In [30]:
#Create a geospatial heatmap showing avg Ozone levels per month in 2024
fig = px.density_mapbox(
    mth_cty_latlon_oz_avgs_II,
    lat='Site Latitude',
    lon='Site Longitude',
    z='Daily Max 8-hour Ozone Concentration',
    radius=20,
    animation_frame='Month Name',
    center=dict(lat=37.3, lon=-123),
    zoom=3.6,
    mapbox_style="carto-positron",
    color_continuous_scale="OrRd",
    #max avg Ozone level ~ 0.099
    range_color=[0,0.1],
    title="Monthly Ozone Concentration Heatmap (2024)")

#edit plot details
fig.update_layout(title_x=0.5, updatemenus=[{'buttons':[{'args':[None, {'frame':{'duration':1500}}]}]}])
fig.update_coloraxes(colorbar_title="Average 8-hour Ozone Concentration (ppm)")
fig.show()

The geospatial heatmap above, like figures previous, display average ozone levels by latitude & longitude in 2024. The darker & redder the color, the more extreme the abundance of ozone. Note that not every area in California had available data in this plot.
In a way, this figure is more effective because the data is visualized in a way very alike to what is being plotted: ozone abundances. Rather than being definitively outlined into areas or counties, as seen in previous figures, this plot visualizes the data more akin to the way ozone actually behaves; indiscriminate & ignorant of human-made "lines on maps." On the other hand, a negative aspect of this plot in comparison to others used previously is that it is easier for data points to overlap which can cause ozone levels to appear more severe than what they actually are. For example, in January, the most severe ozone levels appear as if they are 0.080 ppm or higher, but in actuality, the maximum average ozone level in this month was about 0.044 ppm. This is due to coordinates overlapping on this figure. Such a plot would be more effective if there were an equal number of data points for all pairs of coordinates in California, particularly if they were spread across the entire state equidistantly to one another.

Nevertheless, patterns previously described are generally applicable in this figure too. For example, ozone levels peaked in June & July whereas they were at their lowest in January & December of 2024. Other findings were presented in this figure as listed below.

  • In January, in addition to the southern, central area of California, there is a small area further north in the central part of the state that exhibit amounts of average ozone levels. These two general areas were considerably more striking in terms of color than anywhere else in the state.
  • These two areas are the most striking throughout the year in 2024 as the state follows the typical pattern described multiple times previously--increasing ozone levels from January through July & decreasing levels from then onwards.
  • In addition to these two primary areas, another region or two in between also become prominent in terms of average ozone levels. Other areas directly adjacent to these primary regions also exhibit increasing levels of ozone through the first half of the year & decreasing levels in the latter half.
  • On the other hand, areas in the northernmost, easternmost, & areas directly on the coastline only exhibit slight changes in ozone abundances relative to the primary ones.

In summary, ozone in California varied the most & was the most extreme in ths southern & central parts of California. Areas in the northern part of the state as well as on the coast exhibited the least variance in ozone levels.


As discussed at the beginning of the section, data points with the greatest & smallest concentrations of ozone across each of the 48 Californian counties could be obtained & used to determine which areas experienced the most & least variance in their ozone levels throughout 2024. This process was performed next.

In [31]:
#Retrieve min/max ozone levels per 'County'
cty_minmax_oz = data_II.groupby('County').agg({'Daily Max 8-hour Ozone Concentration':['min','max']}).reset_index()
#Fix the columns
cty_minmax_oz.columns = ['County','Min Ozone Level','Max Ozone Level']

#Append the corresponding dates of each data point - MINIMUM
cty_minmax_oz = cty_minmax_oz.merge(data_II[['New Date','Daily Max 8-hour Ozone Concentration','County']], 
                        left_on=['County','Min Ozone Level'], right_on=['County','Daily Max 8-hour Ozone Concentration']).drop(
                    columns=['Daily Max 8-hour Ozone Concentration']).rename(columns={'New Date':'Date - Min'})
#MAXIMUM
cty_minmax_oz = cty_minmax_oz.merge(data_II[['New Date','Daily Max 8-hour Ozone Concentration','County']], 
                        left_on=['County','Max Ozone Level'], right_on=['County','Daily Max 8-hour Ozone Concentration']).drop(
                    columns=['Daily Max 8-hour Ozone Concentration']).rename(columns={'New Date':'Date - Max'})
    #Note that there are 93 rows indicating that some counties (48) have multiple data points for min/max (or both) ozone levels

#Calculate difference between max/min ozone levels for each 'County'
cty_minmax_oz['Max-Min Ozone Level Difference'] = cty_minmax_oz['Max Ozone Level'] - cty_minmax_oz['Min Ozone Level']

#Create a variable indicating if a county's max ozone level exceeded the EPA's standard of 0.070 ppm
cty_minmax_oz["Max Ozone Exceeds EPA's Standard?"] = cty_minmax_oz['Max Ozone Level'].apply(
                                                lambda x: 'Yes' if x >= 0.07 else 'No')

#Reorganize columns
cty_minmax_oz = cty_minmax_oz[['County','Min Ozone Level','Date - Min','Max Ozone Level','Date - Max',
                               'Max-Min Ozone Level Difference', "Max Ozone Exceeds EPA's Standard?"]]

display(cty_minmax_oz.head(2))
County Min Ozone Level Date - Min Max Ozone Level Date - Max Max-Min Ozone Level Difference Max Ozone Exceeds EPA's Standard?
0 Alameda 0.006 2024-12-22 0.069 2024-10-01 0.063 No
1 Amador 0.010 2024-12-20 0.083 2024-07-03 0.073 Yes
In [32]:
#Plot 10 counties with largest & smallest difference between min-max ozone levels

#Need to remove duplicate county data points --> 48 rows
cty_minmax_oz_uniq = cty_minmax_oz.drop_duplicates(subset=['County'])

#ensure the two plots have the same vertical axis
fig, axes = plt.subplots(nrows=2, figsize=(9,5), sharey=True)

#SMALLEST MAX-MIN DIFFERENCES
ax1 = sns.barplot(data=cty_minmax_oz_uniq.nsmallest(10, 'Max-Min Ozone Level Difference').sort_values(
                'Max-Min Ozone Level Difference'), x='County', y='Max-Min Ozone Level Difference', ax=axes[0])
#LARGEST MAX-MIN DIFFERENCES
ax2 = sns.barplot(data=cty_minmax_oz_uniq.nlargest(10, 'Max-Min Ozone Level Difference').sort_values(
                'Max-Min Ozone Level Difference', ascending=False), x='County', y='Max-Min Ozone Level Difference', ax=axes[1])

#edit plot details
for k, v in {'10 California Counties with Smallest Difference in Average Monthly Ozone Levels (2024)': [0],
            '10 California Counties with Greatest Difference in Average Monthly Ozone Levels (2024)': [1]}.items():
    plt.sca(axes[v[0]])
    plt.title(k, fontsize=11.5)
    plt.xticks(fontsize=8)
    plt.ylabel('Diff Between Max & Min Avg Ozone Levels (ppm)', fontsize=6)
    plt.yticks(fontsize=7, ticks=np.arange(0,0.151, 0.030))
    plt.grid(axis='y',alpha=0.2,color='gray')
    plt.tight_layout();
C:\Users\elija\AppData\Local\Temp\ipykernel_14484\2732422424.py:25: UserWarning:

The figure layout has changed to tight

The two plots above illustrate the Californian counties with the most & least extreme differences between their minimum & maximum average monthly ozone levels in 2024. The first plot lists the ten counties with the smallest difference, whereas the bottom plot displays the ten counties with the greatest difference. Recall that the dataset had data for 48 of the 58 Californian counties. Additionally, note that the EPA considers ozone levels of 0.07 ppm or higher to be unhealthy & potentially dangerous.

In 2024, the California counties that experienced the least variance in average ozone levels included Santa Cruz, San Francisco, Humboldt, Sonoma, & Lake counties. In other words, of the 48 Californian counties included in the data, these aforementioned five exhibited the most stable ozone levels throughout 2024. The first four & six of the ten counties in the first plot lie along the western coast of the state. The other four counties--Lake, Colusa, Mariposa, & Yolo--lie adjacent to the western coast &/or are in the central, slightly northern part of California. These patterns are consistent with findings reported earlier in that counties directly on the western coast or near the northern or central area of the state exhibited less variance & lower ozone levels throughout 2024 on average.

On the other hand, the counties of San Bernardino, Imperial, Riverside, Los Angeles, & Tulare experienced the greatest variance in average ozone levels in 2024. In other words, these ten counties in the second plot exhibited the least stable ozone levels throughout 2024. Seven of these ten counties are in the southernmost third of the state, & the other three--Fresno, Kings, & Tulare--are directly to the north. These findings are consistent with findings reported earlier in that most counties in southern & south-central regions of California experienced higher & relatively unstable ozone levels on average in 2024.

To help visualize where these counties are in the state, as well as how their average ozone levels developed over the course of 2024, another choropleth can be made that only distinguishes these 20 counties.

In [33]:
#Want to make a choropleth with these 20 counties illustrating monthly average ozone levels

#First, obtain the 20 counties
ctys_20_min_max = pd.concat([cty_minmax_oz_uniq.nsmallest(10, 'Max-Min Ozone Level Difference'), 
                   cty_minmax_oz_uniq.nlargest(10, 'Max-Min Ozone Level Difference')], axis=0)['County'].tolist()

#Want to retrieve the monthly ozone averages of these counties; however, for the choropleth, 
    #need to change the avg ozone levels to null values so that they show up on the figure but without any data.
    #Make these changes in the 'Imputed Ozone Conc.' variable.
#retrieve the monthly ozone avgs per county
ctys_20_mth_oz_avgs = mth_cty_oz_avgs_II.copy()
#Loop through the dataset by row
for i in ctys_20_mth_oz_avgs.index:
    #for rows where the 'County' is not among the 20 of relevance
    if ctys_20_mth_oz_avgs.loc[i,'County'] not in ctys_20_min_max:
        #change the ozone avg to 0
        ctys_20_mth_oz_avgs.loc[i,'Imputed Ozone Conc.'] = 0
    #for rows where the 'County' is among the 20 of relevance
    else:
        ctys_20_mth_oz_avgs.loc[i,'Imputed Ozone Conc.'] = ctys_20_mth_oz_avgs.loc[i,'Imputed Ozone Conc.']

#As a result, need to make corresponding adjustments to the 'Ozone Info' variable
ctys_20_mth_oz_avgs['Ozone Info'] = ctys_20_mth_oz_avgs.apply(lambda row: 
                            f"{row['County']}: {row['Imputed Ozone Conc.']:.3f} ppm" +
                            (" (County is not relevant — filled with 0)" if row['Imputed Ozone Conc.'] == 0 else ""), axis=1)

#Check -- GOOD
#display(ctys_20_mth_oz_avgs[(ctys_20_mth_oz_avgs['County'].isin(ctys_20_min_max)) & 
#        (ctys_20_mth_oz_avgs['Imputed Ozone Conc.'] != 0)])
    #when excluding ozone avgs of 0, have 239 rows (20 counties * 12 months = 240 rows)
        #the 1 missing row is because Sonoma did not have any data points in March 2024--see code below
        #display(clean_df[clean_df.County == 'Sonoma']['New Date'].dt.month.value_counts(dropna=False).sort_index())
In [34]:
#Make the figure
fig = px.choropleth(
    ctys_20_mth_oz_avgs,
    geojson=gdf.__geo_interface__,
    locations='County',
    featureidkey='properties.County',
    animation_frame='Month Name',
    hover_name='Ozone Info',
    color='Imputed Ozone Conc.',
    #max avg Ozone level ~ 0.082
    range_color=[0,0.085],
    color_continuous_scale="OrRd",
    title="Counties with Most/Least Extreme Monthly Average Ozone Levels (California, 2024)")

#edit plot details
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(title_x=0.5, width=900, height=500, updatemenus=[{'buttons':[{'args':[None, {'frame':{'duration':1400}}]}]}])
fig.update_coloraxes(colorbar_title="Average 8-hour Ozone Concentration (ppm)")
fig.show()

The figure above illustrates the 20 Californian counties from the most previous bar plots. These 20 counties, split into two pairs of ten, experienced the most & least variance in their average monthly ozone levels in 2024. This choropleth directly above was made to help identify where these counties are geographically. As opposed to the previous bar plots, this choropleth displays the average monthly ozone levels exhibited by each Californian county in 2024 rather than the differences between each county's largest & smallest monthly ozone levels.

In this figure, as the months go bye, the changes in ozone concentrations between the counties towards the southern part of the state versus those directly on the western coast & towards the northern area is quite apparent. The southernmost counties, such as Los Angeles, San Bernardino, & Riverside exhibit much more variance in their average ozone levels as opposed to those along the western coast (e.g. Monterey & Santa Cruz) or towards the northern part of the state (e.g. Humboldt & Lake counties). As previosuly stated, these findings are consistent with those from earlier in the section.


Analysis II - Conclusions¶

How does daily maximum 8-hour ozone concentration vary over time and regions?

The Analysis II section examined the data across counties to determine how ozone levels varied over time in California throughout 2024. There are 58 unique counties in California but only 48 had observations in the dataset. A variety of figures plus some arithmetic helped to uncover which counties & areas in California exhibited the greatest variance in ozone concentrations during 2024.

First, a general analysis of average monthly ozone levels across the entire state of California was performed. It was found that ozone levels peaked in June & July at around 0.050 parts per million (ppm) & then bottomed out in December & January at just over 0.030 ppm.
This trend is replicated in most of the Californian counties during 2024. Additionally, counties that saw higher maximum ozone levels typically exhibited greater variance in terms of their ozone levels. At the beginning of the year, differences between average ozone levels were fairly similar across the 48 counties; however, these differences continued to widen through July when most counties' ozone levels peaked in 2024. From then on, counties' average ozone levels generally declined as well as the gaps between their ozone levels.

Through the handful of visuals presented in this section, it was found that average monthly ozone levels meaningfully differed across three general areas in California.

  • The southernmost & south-central part of the state exhibited the greatest variance in terms of average ozone concentrations. This area included counties like San Bernardino (the largest county in California by area), Los Angeles (the most populated county), Imperial, Kern, & Tulare.
  • Of the northern & north-central counties in California with data, these areas exhibited some considerable variance in their average monthly ozone levels through 2024, but these metrics were not as significant as the southern parts of the state. These areas included counties such as Shasta, Siskiyou, Inyo, & Sacaramento.
  • Counties directly along the western coast excluding those in the southernmost half of California exhibited the lowest average ozone levels as well as the least variance. These regions included counties like Humboldt, Sonoma, Monterey, & Santa Cruz. Furthermore, average ozone levels in these counties were the most inconsistent with the general trend pointed out previously that was also replicated in the two aforementioned general areas (see the two previous bulletpoints). In these northern counties along or near the western coast, average ozone levels increased slightly between January & April, but they stabilized for most of the rest of the year before descending slightly in November & December.

In summary, areas towards the southern & central parts of California exhibited more variance in average ozone levels compared to the rest of the state. On the other hand, counties directly along the western coast as well as parts of northern California exhibited the least variance in ozone concentrations during 2024.


Analysis III¶

Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels?

Here, there are two questions & so this section will answer them separately.

Analysis III-A¶

Are there any areas that consistently show high ozone concentrations?

If a geographic area exhibits high levels of ozone, it could indicate that it experiences significant amounts of air pollution, greater exposure to harmful radiation, or other factors that could contribute to greater concentrations of ozone. More consistent levels of ozone might indicate that there is more stability in the atmosphere of such an area in terms of how healthy it is & how much air pollution there is. In totality, areas that exhibit high levels of ozone on a consistent basis are more prone to health hazards like less visibility, an increased level of unhealthy or toxic gases in the air, respiratory problems, negative ecological consequences, & so on.
Identifying relevant areas in California can help distinguish how many people are potentially at risk of such consequences. Additionally, it can point organizations & programs that actively study &/or work to combat high ozone levels (like the EPA) towards more vulnerable or affected areas & populations.

To determine which areas of California consistently had high levels of ozone in 2024, multiple metrics can be used. Firstly, the variance--the spread of values around the mean--can be used to evaluate which counties had the most stable ozone levels throughout the year or parts of it. The variance in daily eight-hour ozone levels across the entire state was about 0.00021 in 2024. To find areas with high concentrations of ozone, the mean could be used to distinguish data points with values above average. Aross 2024, the average daily eight-hour ozone level in California was about 0.044 parts per million (ppm).

In Analysis II, counties were used to evaluate ozone levels throughout California. For the sake of consistency, this section should follow the same practice. To evaluate consistency, the data can be analyzed over time. Analysis II included a monthly analysis, an increment which could be used again here. A period of 30 days seems appropriate for evaluating consistency in this case because it encompasses a representative number of data points that is not too many such that the analysis becomes overly diluted & it leans too much towards the "central tendencies" of the values.
With this information, counties that exhibit high values of the 'Daily Max 8-hour Ozone Concentration' variable with low monthly variances are those of greatest interest in this section. This process of evaluating consistency could be expanded across consecutive months to determine the stability of counties' ozone levels over longer periods of time. This could be done by comparing counties' variances of ozone levels between months.

Furthermore, Analysis II utilized multiple figures to visualize how average monthly ozone levels varied by county throughout 2024 in California. A similar approach can be used here to display areas in the state with consistently high levels of ozone. Recall that latitudes & longitudes need to be used for mapping locations on to some of them including geographic scatter plots & geospatial heatmaps.

To start, the mean & variance of the 'Daily Max 8-hour Ozone Concentration' variable was calculated for each of the 58 counties in California. With these averages, the counties that most consistently show high levels of ozone can be determined.
Recall that only 48 counties are in the dataset so the other ten have to be attached. Additionally, there are about 2,700 data points that are missing values for the 'Daily Max 8-hour Ozone Concentration' variable, which is about five percent the size of the cleaned version of the dataset (clean_df). These observations were removed from subsequent analyses.

In [35]:
#Filter the data for the relevant variables, remove observations with missing Ozone data
data_III = clean_df.dropna(subset=['Daily Max 8-hour Ozone Concentration'])[['New Date',
                'Daily Max 8-hour Ozone Concentration','Method Code','CBSA Code','CBSA Name','County FIPS Code','County',
                                                                        'Site Latitude','Site Longitude']]
#Create variables for the month & quarter of each observation
data_III['Month'] = data_III['New Date'].dt.month
data_III['Quarter'] = data_III['New Date'].dt.quarter
#display(data_III.sample(2))
data_III.info()
<class 'pandas.core.frame.DataFrame'>
Index: 51309 entries, 0 to 54755
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   New Date                              51309 non-null  datetime64[ns]
 1   Daily Max 8-hour Ozone Concentration  51309 non-null  float64       
 2   Method Code                           45295 non-null  float64       
 3   CBSA Code                             49057 non-null  float64       
 4   CBSA Name                             49057 non-null  object        
 5   County FIPS Code                      51309 non-null  int64         
 6   County                                51309 non-null  object        
 7   Site Latitude                         51309 non-null  float64       
 8   Site Longitude                        51309 non-null  float64       
 9   Month                                 51309 non-null  int32         
 10  Quarter                               51309 non-null  int32         
dtypes: datetime64[ns](1), float64(5), int32(2), int64(1), object(2)
memory usage: 4.3+ MB
In [36]:
#Calculate mean & variance of ozone levels per 'County', 'Month'
mth_cty_oz_avgs_var_I = data_III.groupby(['Month','County']).agg({'Daily Max 8-hour Ozone Concentration':['mean',
                                                        'var']}).reset_index()
    #this data has 575 month-county pairs because some counties don't have data; 58 CA counties * 12 months = 696 pairs
#Fix the columns
mth_cty_oz_avgs_var_I.columns = ['Month','County','Avg Ozone','Ozone Variance']

#Merge all CA-county-month pairs w/avgs & variances to include pairs with no data -- 696 pairs, GOOD
mth_cty_oz_avgs_var_II = mth_cty_oz_avgs_var_I.merge(mth_CA_cty_pairs, on=['Month','County'], 
                                    how='outer').sort_values(['Month','County']).reset_index(drop=True)

#Attach the GeoJSON data template data
mth_cty_oz_avgs_var_II = mth_cty_oz_avgs_var_II.set_index('County').join(gdf.set_index('County'), on='County', 
                                    how='left').sort_values(['Month','County']).reset_index()
#Add a variable that imputes missing Ozone concentrations with 0
mth_cty_oz_avgs_var_II['Imputed Ozone Conc.'] = mth_cty_oz_avgs_var_II['Avg Ozone'].fillna(0)
#Create a variable for hover_info in the plotly.express figures to indicate whether a county had missing Ozone data or not
mth_cty_oz_avgs_var_II['Ozone Info'] = mth_cty_oz_avgs_var_II.apply(lambda row: 
                            f"{row['County']}: {row['Imputed Ozone Conc.']:.3f} ppm" +
                            (" (No data — filled with 0)" if row['Imputed Ozone Conc.'] == 0 else ""), axis=1)

#Create another variable indicating the 'Month' name instead of number
mth_cty_oz_avgs_var_II['Month Name'] = pd.to_datetime(mth_cty_oz_avgs_var_II['Month'], format='%m').dt.month_name()
In [37]:
#Calculate avg ozone level per county using the avg ozone levels per CA county-month pair
avg_oz_cty = mth_cty_oz_avgs_var_II.groupby('County')['Avg Ozone'].mean().reset_index()
    #mean ozone level across all counties ~ 0.04091536549742834
#Calculate avg variance of ozone levels per county using the ozone variances per CA county-month-pair
avg_oz_var_cty = mth_cty_oz_avgs_var_II.groupby('County')['Ozone Variance'].mean().reset_index()
    #mean monthly variance in ozone across all counties ~ 0.00006420562336263973

#Apply ranks to each of these 2 variables where 1st is for the smallest value. Recall that 10/58 counties have NaN's.
    #Need to account for ties. May need to round values to make sure ranks are equal when values are equal.
    #Want 1st in Avg Ozone to be the county with the highest avg ozone levels.
    #Want 1st in Ozone Variance to be the county with the most consistent ozone levels; least variance
avg_oz_cty['Avg Ozone Rank'] = avg_oz_cty['Avg Ozone'].round(6).rank(ascending=False, method='min')
avg_oz_var_cty['Ozone Variance Rank'] = avg_oz_var_cty['Ozone Variance'].round(6).rank(ascending=True, method='min')

#Merge the mean & variance data, rename columns
avg_var_oz_cty = avg_oz_cty.merge(avg_oz_var_cty, on='County').rename(columns={'Ozone Variance':'Avg Ozone Variance'})


#Find # counties whose avg ozone level (over 12 months) is greater than the whole 12-month avg across all counties
#avg_var_oz_cty[avg_var_oz_cty['Avg Ozone'] > avg_var_oz_cty['Avg Ozone'].mean()]
    #58 -> 25 counties
#Find # counties whose avg ozone variance (over 12 months) is less than the whole 12-month avg variance across all counties
#avg_var_oz_cty[avg_var_oz_cty['Avg Ozone Variance'] < avg_var_oz_cty['Avg Ozone Variance'].mean()]
    #58 -> 28 counties

#Find # counties whose avg ozone level & variance (over 12 months) is greater & less than (respectively) the whole 12-month...
    # avg & avg variance across all counties respectively
cons_high_oz_ctys = avg_var_oz_cty[(avg_var_oz_cty['Avg Ozone'] > avg_var_oz_cty['Avg Ozone'].mean()) & 
              (avg_var_oz_cty['Avg Ozone Variance'] < avg_var_oz_cty['Avg Ozone Variance'].mean())]
    #58 counties -> 10 counties
display(cons_high_oz_ctys.sort_values('Avg Ozone Variance', ascending=True))
    #these 10 counties are in the top 24 (half) in avg ozone, & the top 28 in avg variance in ozone

#Retrieve these 10 counties
ten_cons_high_oz_ctys = cons_high_oz_ctys['County'].tolist()
County Avg Ozone Avg Ozone Rank Avg Ozone Variance Ozone Variance Rank
3 Butte 0.042969 21.0 0.000042 9.0
21 Mariposa 0.047290 7.0 0.000044 12.0
13 Inyo 0.044712 13.0 0.000050 15.0
54 Tuolumne 0.043846 18.0 0.000051 16.0
50 Sutter 0.042662 23.0 0.000052 17.0
55 Ventura 0.044148 17.0 0.000054 18.0
8 El Dorado 0.044419 15.0 0.000054 18.0
2 Amador 0.043582 19.0 0.000055 20.0
28 Nevada 0.043001 20.0 0.000056 22.0
44 Shasta 0.042517 24.0 0.000064 28.0
In [38]:
#Visualize relationship between avg ozone levels & avg monthly variance in ozone levels per county in 2024

ax1 = sns.scatterplot(data=avg_var_oz_cty, x='Avg Ozone', y='Avg Ozone Variance')
#indicate avg ozone level across all counties
plt.axvline(avg_var_oz_cty['Avg Ozone'].mean(), color='orange', linestyle='--', alpha=0.6)
#indicate avg monthly variance in ozone levels across all counties
plt.axhline(avg_var_oz_cty['Avg Ozone Variance'].mean(), color='orange', linestyle='--', alpha=0.6)

#edit plot details
plt.title('Average Variance vs Average Ozone Levels per CA County - 2024', fontsize=11)
plt.xlabel('Avg Ozone Level (ppm)', fontsize=10)
plt.xticks(fontsize=8.5)
plt.ylabel('Avg Monthly Variance in Ozone Levels', fontsize=10)
plt.yticks(fontsize=8.5)
#label each quadrant, accentuate the quadrant of interest
for k, v in {'Inconsistent, Low Ozone Levels':[0.029, 0.000108, 'normal'], 
             'Inconsistent, High Ozone Levels':[0.045, 0.000108, 'normal'],
             'Consistent, Low Ozone Levels':[0.029, 0.00003, 'normal'], 
             'Consistent, High Ozone Levels':[0.045, 0.00003, 'bold']}.items():
    plt.annotate(k, xy=(v[0],v[1]), fontsize=7, weight=v[2])
plt.tight_layout();

The figure above illustrates the average amount of monthly variance in a California county's ozone levels against their average ozone levels in 2024. There are 58 counties in California but only 48 had observations in the dataset. Note that the horizontal axis is in units of parts per million (ppm). The two dashed lines indicate the average value across all counties for each variable, which splits the figure into quadrants.

Each of the four sections is labeled with an easily understandable interpretation of the data. For example, in the lower right quadrant, the "Consistent, High Ozone Levels" label indicates that data points in this quadrant have above average ozone levels & below average ozone variances. In other words, these counties, on average, have relatively high ozone levels that are more consistent or less varied.

Notice how the data points appear to follow a fairly linear trend indicating that the greater the average ozone level of a county, the greater the variance in a county's typical ozone levels. This finding is reinforced by the fact that there are more counties in the lower left & upper right quadrants (18 & 15 counties respectively) than in the upper left & lower right quadrants (five & ten counties respectively).
Recall that the counties of interest in this section are those that consistently show high ozone concentrations. These areas correspond to the lower right quadrant which is distinguished with the bolded font.

With this generally linear relationship between the two axes, the consequence is that the corresponding counties in the quadrant of interest do not correspond to the most extreme data points across the two variables. More specifically, most counties in the lower-right quadrant do not have the highest average ozone levels across California, nor do they have the least variance in average monthly ozone levels. Nevertheless, these counties most qualify as those that consistently show high ozone concentrations. These Californian counties, which can be referenced via the ten_cons_high_oz_ctys variable, include:

  • Amador, Butte, El Dorado, Inyo, Mariposa, Nevada, Shasta, Sutter, Tuolumne, & Ventura.
  • More specifically, the Mariposa, Inyo, & El Dorado counties have the highest average ozone levels of these ten counties. Moreover, the Butte, Mariposa, & Inyo counties have the most consistent average ozone levels of these ten counties. As such, it can be said that the Mariposa & Inyo counties in particular exhibited the highest & most consistent ozone concentrations in 2024.

Of course, the ranges of these quadrants could be expanded slightly to include more counties, but they might not have above-average ozone levels or below-average ozone variances across the 48 Californian counties. As such, these ten counties were retained & used moving forward. Next, the ozone levels of these counties can be visualized on a choropleth like in Analysis II.

In [39]:
#Want to retrieve the monthly ozone averages of these 10 counties; however, for a choropleth, need to change the 
    #avg ozone levels of the other counties to 0's so that they show up on the figure without any data.
    #Make these changes in the 'Imputed Ozone Conc.' variable.

#Retrieve the monthly ozone avgs per county
mth_cty_oz_avgs_var_10 = mth_cty_oz_avgs_var_II.copy()

#Loop through the dataset
for i in mth_cty_oz_avgs_var_10.index:
    #for rows where the 'County' is not among the 10 of interest
    if mth_cty_oz_avgs_var_10.loc[i,'County'] not in ten_cons_high_oz_ctys:
        #change the imputed ozone avg to 0
        mth_cty_oz_avgs_var_10.loc[i,'Imputed Ozone Conc.'] = 0
    #for rows where the 'County' is among the 10 of interest
    else:
        mth_cty_oz_avgs_var_10.loc[i,'Imputed Ozone Conc.'] = mth_cty_oz_avgs_var_10.loc[i,'Imputed Ozone Conc.']

#As a result, need to make corresponding adjustments to the 'Ozone Info' variable
mth_cty_oz_avgs_var_10['Ozone Info'] = mth_cty_oz_avgs_var_10.apply(lambda row: 
                            f"{row['County']}: {row['Imputed Ozone Conc.']:.3f} ppm" +
                            (" (County is not relevant — filled with 0)" if row['Imputed Ozone Conc.'] == 0 else ""), axis=1)
In [40]:
#Make a line plot of the 10 CA-counties with "highest, most consistent" avg ozone levels in 2024
plt.figure(figsize=(9,4))

ax1 = sns.lineplot(data=mth_cty_oz_avgs_var_10[mth_cty_oz_avgs_var_10.County.isin(ten_cons_high_oz_ctys)], 
                   x='Month Name', y='Imputed Ozone Conc.', hue='County', marker='.', alpha=0.8)

#edit plot details
plt.title('Counties with the Highest, Most Consistent Ozone Concentrations (California, 2024)')
plt.xlabel('Month')
plt.xticks(fontsize=8.5)
plt.ylabel('Average Ozone Concentration (ppm)', fontsize=9.5)
plt.yticks(fontsize=8)
ax1.set(ylim=(0,0.07))
plt.legend(loc='lower center', ncols=2, title='County', title_fontsize=9.5, fontsize=7.5)
plt.grid(axis='y',alpha=0.2,color='gray')
plt.tight_layout();

The figure above displays the monthly average ozone levels in 2024 for the ten Californian counties which were deemed to have the highest, most consistent ozone concentrations. These counties, while they didn't have the outright highest nor the most consistent ozone levels, they had the highest & most consistent levels relatively speaking. Note that the vertical axis is in units of parts per million (ppm). For reference, the average ozone level in the state in 2024 is about 0.044 ppm.

Recall the general trend of average monthly ozone levels in California in 2024. Ozone levels were at their lowest in December & January, they increased through the early summer (i.e. July), & they then declined through the rest of the year. These trends are generally reflected in the plot above for these ten counties with some small inconsistencies here & there; e.g. Nevada county's dip in ozone levels from March to April.

Among these ten counties, the following conclusions can be made.

  • The Mariposa, Inyo, & El Dorado counties have the highest average monthly ozone levels.
  • The Butte, Mariposa, & Inyo counties have the most consistent average monthly ozone levels.
  • Across both metrics, the Mariposa, Inyo, El Dorado, Ventura, Tuolumne counties exhibited ozone levels in 2024 that were the greatest & most consistent.
  • Amador has the largest outright average monthly ozone level (about 0.062 in July), but the average monthly variance in their ozone levels is the third greatest of the ten counties.

To help quantify & better understand the level of consistency & magnitude of these counties' ozone levels, they can be compared to those of counties not deemed to have consistent, high ozone concentrations in 2024. More specifically, data points with inconsistent, high ozone concentrations (observed in the upper right quadrant in the second most previous figure) can be used & evaluated against those in the plot above.

In [41]:
#Identify the CA-counties with the greatest monthly 'Avg Ozone' & 'Avg Ozone Variance' levels
high_avg_var_oz_data = avg_var_oz_cty[(avg_var_oz_cty['Avg Ozone'] > 0.044) & (avg_var_oz_cty['Avg Ozone Variance'] > 0.000083)]
high_avg_var_oz_ctys_7 = high_avg_var_oz_data.County.tolist()
In [42]:
#Make a line plot of the CA-counties with "highest, most/least consistent" avg ozone levels in 2024
fig, axes = plt.subplots(figsize=(11,5), ncols=2, sharey=True)

#10 CA-counties with "highest, most consistent" avg ozone levels in 2024
ax1 = sns.lineplot(data=mth_cty_oz_avgs_var_10[mth_cty_oz_avgs_var_10.County.isin(ten_cons_high_oz_ctys)], 
                   x='Month Name', y='Imputed Ozone Conc.', hue='County', marker='.', alpha=0.8, ax=axes[0])
#7 CA-counties with "highest, least consistent" avg ozone levels in 2024
ax2 = sns.lineplot(data=mth_cty_oz_avgs_var_II[mth_cty_oz_avgs_var_II.County.isin(high_avg_var_oz_ctys_7)], 
                   x='Month Name', y='Imputed Ozone Conc.', hue='County', marker='.', alpha=0.8, ax=axes[1])

#edit plot details
for k, v in {'CA Counties with the Highest, Most Consistent Ozone Levels (2024)':[0], 
            'CA Counties with the Highest, Least Consistent Ozone Levels (2024)':[1]}.items():
    plt.sca(axes[v[0]])
    plt.title(k, fontsize=10)
    plt.xlabel('Month')
    plt.xticks(fontsize=7.5, rotation=20)
    plt.ylabel('Average Ozone Concentration (ppm)', fontsize=9.5)
    plt.yticks(fontsize=8)
    axes[v[0]].set(ylim=(0,0.085))
    plt.legend(loc='lower center', ncols=2, title='County', title_fontsize=9.5, fontsize=7.5)
    plt.grid(axis='y',alpha=0.2,color='gray')
    plt.tight_layout();
C:\Users\elija\AppData\Local\Temp\ipykernel_14484\2931303270.py:23: UserWarning:

The figure layout has changed to tight

Of the two figures directly above, the one on the left is the same as the most previous figure. The second plot displays the same general data but for seven other Californian counties which were deemed to be those with the highest & least consistent ozone levels in 2024. These two groups of counties were gathered & visualized against one another in order to better understand the magnitudes & levels of consistency in their ozone levels.

In comparing these two groups of counties, it becomes evident that the main reason that ozone levels were much more consistent for the ten counties on the left than it was for those on the right is because the average monthly ozone levels for the seven counties on the right generally ascended significantly higher between April & October than they did for the ten counties on the left. When these ozone levels were less severe in December & January, they were relatively similar across all 17 of these counties, but in the summer months, when ozone levels usually peaked in California in 2024, the disparity in average ozone levels between these two groups of counties grew severely.

To quantify the differences in these changes in ozone levels, consider the following.

  • Between January & July, the monthly average ozone levels of the ten counties on the left generally increased by somewhere between around 0.015-0.030 ppm.
  • Between January & July, the monthly average ozone levels of the seven counties on the right generally increased by somewhere between around 0.035-0.045 ppm excluding San Diego.

Ultimately, these figures serve to illustrate that while the ten counties on the left didn't have the most severe ozone levels in California outright, they did have the highest AND most consistent ozone levels in 2024 of the 48 counties with data points.

On another note, the monthly average ozone levels of San Diego (seen on the right-side plot) between April & October are a bit contradictory to the patterns seen throughout the project. Instead of increasing up through July & then decreasing thereafter, the average ozone levels in San Diego in 2024 actually declined from April through July before then climbing between August & October.
This county could be worth additional investigation in a separate analysis to try & drill down as to these unusual observations.

Next, these ten counties with "high, consistent" ozone levels were visualized geographically to understand where these areas are in the state of California.

In [43]:
#Create a choropleth of the 10 CA-counties with "highest, most consistent" avg Ozone levels in 2024
fig = px.choropleth(
    mth_cty_oz_avgs_var_10,
    geojson=gdf.__geo_interface__,
    locations='County',
    featureidkey='properties.County',
    animation_frame='Month Name',
    hover_name='Ozone Info',
    color='Imputed Ozone Conc.',
    #max avg Ozone level ~ 0.063
    range_color=[0,0.065],
    color_continuous_scale="OrRd",
    title="Counties with the Highest, Most Consistent Ozone Concentrations (California, 2024)")

#edit plot details
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(title_x=0.5, width=900, height=500, updatemenus=[{'buttons':[{'args':[None, {'frame':{'duration':1400}}]}]}])
fig.update_coloraxes(colorbar_title="Average 8-hour Ozone Concentration (ppm)")
fig.show()

The figure above consists of an animated choropleth which illustrates the monthly average ozone concentrations in 2024 across California. Although there are 58 counties in California, only 48 of them had observations in the dataset. Moreover, only ten of these counties contain data in this choropleth because they were identified as those that exhibited the highest, most consistent ozone concentrations. It should be noted that while these ten counties did not have the most extreme data points in terms of the highest or the most consistent ozone levels, their values across both metrics were the most prominent in regards to the question of interest.
Recall the typical trend of average ozone levels in California throughout 2024: ozone levels were at their lowest in December & January, they increased through the early summer (i.e. July), & they then declined through the rest of the year. In the animation, this pattern is generally observed throughout 2024 across most of the ten counties.

From the previous figure, the Mariposa, Inyo, El Dorado, & Ventura counties most significantly exhibited ozone levels that were both relatively large & consistent across the Californian counties. Geographically, the first three of these counties are central & eastern part of the state, whereas Ventura is the lone county (of the ten) in the southern & western part of California. Moreover, these four counties are the four most southern counties of the ten with data in the figure.

Note that this lattermost finding doesn't indicate that more southern Californian counties experienced higher, more consistent ozone levels than those more north in 2024; however, it could suggest that more southern counties were more likely to exhibit higher & less varying ozone levels throughout the year.
This hypothesis could be evaluated further by finding the areas in the state that exhibited the lowest & least consistent ozone levels.

  • A supporting factor for this hypothesis would point out that ozone levels are more likely to be higher in areas closer to the equator. California is in the northern hemisphere, so more southern parts of the state are more likely to experience greater temperatures because the equator receives the most sunlight throughout a year. Ultimately, more sunlight can contribute to higher temperatures in the atmosphere which can then contribute to a higher concentration of greenhouse gases like ozone.
    • This finding is reinforced by a conclusion made in Analysis II which found that in 2024, ozone levels were typically higher in many of the more southern counties in California than they were in central & more northern counties.

Ultimately, this figure exemplifies that there is no one general area in California which, in 2024, experienced higher & more consistent ozone concentrations than the rest of the state. Instead, there are multiple, smaller, & more indiscriminant parts of California that exhibit these qualities that do not necessarily belong to a cohesive, general area. Moreover, adjacent counties who also reside in the same area of these counties of interest exhibited conflicting observations.
As such, there are no general areas in California that consistently had high ozone concentrations in 2024. Nonetheless, the following counties that most embodied these characteristics include:

  • Mariposa, Inyo, El Dorado, Ventura, Tuolumne

Analysis III-A Conclusions¶

Are there any areas that consistently show high ozone concentrations?

Throughout this section, two main metrics were used to evaluate this prompt because it inquired about two unique elements of ozone concentrations. Importantly, this section sought areas in California that exhibited both (not one or the other) consistency & large magnitudes of ozone levels. To determine which areas had higher ozone levels, averages were used, whereas for consistencies of ozone levels, the metric of variance was used. As in Analysis II, counties were used to evaluate ozone levels on a geographic basis. Additionally, ozone levels of the counties were analyzed on a monthly basis to be able to evaluate consistency & how much they evolved.

When evaluating these two metrics--monthly average ozone levels & their variances--across the 48 Californian counties, it was found that they generally have a linear relationship. In other words, counties that exhibited higher monthly average ozone levels typically had a greater variance in their average ozone levels & vice versa. As such, counties with consistent and high ozone concentrations did not necessarily have the highest nor the most consistent ozone levels.

Nevertheless, the counties that most qualified for these two characteristics were obtained by filtering the data for counties with above average monthly ozone levels & below average variances in ozone concentrations. This process resulted in the following ten counties: Amador, Butte, El Dorado, Inyo, Mariposa, Nevada, Shasta, Sutter, Tuolumne, & Ventura.

  • Across both metrics, the Mariposa, Inyo, El Dorado, Ventura, Tuolumne counties exhibited ozone levels in 2024 that were the most severe & the most consistent.

Additionally, the monthly average ozone levels of these counties were compared to those with the most severe & least inconsistent ozone levels to illustrate the differences in variance of their ozone concentrations throughout 2024.
Recall that the EPA considers ozone levels of 0.07 ppm or higher to be unhealthy & potentially dangerous. The average monthly ozone levels of these ten most relevant counties peaked at about 0.06 ppm in July, so these areas were not in imminent danger in 2024; however, with increasing levels of climate change & warming temperatures, this could change soon.


Analysis III-B¶

Do different methods report different ozone levels?

In the cleaned version of the dataset, there are four unique collection methods which are labeled as: 47, 53, 87, & 199. There is no additional information as to the differences or characteristics of these identifiers in the dataset. Moreover, about 6,300 observations (about 11.6%) in the dataset do not list a distinct collection method which is too large a portion to simply remove. In Analysis I, these missing values were not imputed because there wasn't any clear way to do so.
As such, instead of removing or ignoring these observations, they can be considered as simply another category & compared to the observations with a listed collection method.

Similar to prior analyses in the project, similar metrics, processes, & visuals can be used here. For example, the average monthly ozone concentration of each collection method can be calculated, plotted, & compared in order to determine if & how they differentiate.

To start, the number of observations associated with each method code can be determined to better understand the distribution of data points. Moreover, if this analysis plans to evaluate this variable on a monthly basis, it may be informative to also check the number of observations for each method code per month. For instance, this could reveal if certain method codes did or did not have observations at certain points in 2024.

In [44]:
#First, check distribution of data points per 'Method Code' & include missing observations
#create a new version of the dataset
data_III_B = clean_df.copy()
#fill the missing values with "Missing"
data_III_B = data_III_B.fillna({'Method Code':'Missing'})

#Plot the distribution
ax1 = sns.countplot(data=data_III_B, x='Method Code', order=[87.0,47.0,'Missing',199.0,53.0])

#edit plot details
plt.title('Number of Data points per Method Code')
ax1.bar_label(ax1.containers[-1], labels=[f'{num}' for num in data_III_B['Method Code'].value_counts()], fontsize=8.5)
plt.xticks(fontsize=9)
plt.yticks(fontsize=9)
plt.grid(axis='y',alpha=0.2,color='gray')
plt.tight_layout();

The figure above demonstrates the number of observations in the dataset (clean_df) associated with each 'Method Code.' The "Missing" category, which accounts for about 11.6% of the dataset, denotes observations that did not have a listed collection method. Note that these numbers reflect the total number of observations associated with each collection method; however, there are about 2,700 observations in the dataset in that do not have a listed ozone reading.
From this figure, it is clear that most of the observations were measured using the collection method "87" (about 65%). The least popular method, "53," accounts for less than one percent of the data points. As such, observations with the "53" collection method may be more likely to be affected by outliers because there are fewer "normal" data points that can dilute outliers.

For the purposes of this section, this distribution doesn't necessarily have any direct implications, but it can be helpful to have in mind should the ozone levels associated with the collection method "53" be significantly different than any of the other methods. Although 386 observations is a very small subset of the 54,000-odd rows in the dataset, it is still a sizable enough portion from which representative conclusions can be made.

Additionally, this same process was done except it was expanded upon to a monthly basis.

In [45]:
#Check # data points per 'Method Code' & month
num_meth_mth_A = data_III_B.groupby([data_III_B['New Date'].dt.month, 'Method Code']).size().reset_index().rename(
            columns={'New Date':'Month', 0:'# data points'})
#Need to make a template containing all pairs of months & method code bc some don't have data; 12 months * 5 codes = 60 rows
meth_mth_template = pd.DataFrame(list(itertools.product(list(range(1,13,1)), 
                                    data_III_B['Method Code'].unique().tolist()))).rename(columns={0:'Month',1:'Method Code'})
#Merge the data & template to get all method-month pairs
num_meth_mth_B = meth_mth_template.merge(num_meth_mth_A, on=['Month','Method Code'], how='outer').fillna(0)
display(num_meth_mth_B.sort_values('# data points').head(13))


#Plot the data
ax1 = sns.lineplot(data=num_meth_mth_A, x='Month', y='# data points', hue='Method Code', marker='.', alpha=0.85)

#edit plot details
plt.title('Number of Monthly Observations per Collection Method (California 2024)')
plt.xticks(ticks=range(1,13,1), fontsize=9)
plt.ylabel('# Observations')
plt.yticks(fontsize=9)
plt.legend(loc='center right', title='Method Code', title_fontsize=9, fontsize=7.5)
plt.grid(axis='y',alpha=0.2,color='gray')
plt.tight_layout();
Month Method Code # data points
4 1 53.0 0.0
14 3 53.0 0.0
9 2 53.0 0.0
49 10 53.0 24.0
59 12 53.0 27.0
54 11 53.0 27.0
44 9 53.0 30.0
39 8 53.0 41.0
29 6 53.0 55.0
19 4 53.0 59.0
24 5 53.0 60.0
34 7 53.0 63.0
6 2 Missing 192.0

The figure above, similarly to the most previous one, illustrates the numer of observations associated with each method code (including those without a listed method code) per month in 2024. There were four unique method codes plus the category encompassing observations without a listed collection method which is labeled "Missing." Note that these numbers reflect the number of total observations associated with method code; however, there are about 2,700 observations in the dataset that do not have a listed ozone reading. Clearly, the "87" method accounted for a significant majority of the data points whereas the "53" method accounted for the fewest.

The "53" collection method has between zero & 63 data points per month in 2024, whereas the other three method codes have at least 292 observations or more per month. Such information brings into question the credibility of observations associated with the former collection method ("53"). Ultimately, no direct adjustments need to be made to the data associated with this code, but such a small number of data points could have repercussions in subsequent analyses.
Somes other notes regarding this figure are listed below.

  • The most collection method "87" is by far the most popular one across 2024 as more than 2,500 data points were measured each month.
  • The collection method "53" did not have data for the first three months of the year. The cause behind this is unknown but perhaps it encountered some issues that prevented it from being usable during these months, or it might have been a brand new method that wasn't ready to be used until April. Ultimately, this method is quite uncommon compared to the others.
  • The "47" collection method dropped in its usage rate by about half over the first few months in 2024.
  • The number of observations without a listed collection method grew by about 200% by the end of the year. Also, three of the four collection methods had fewer observations at the end of 2024 than there were observations without a listed method code. Did the organization collecting these data points experience increasing difficulty in procuring their observations? Did their database encounter an issue that caused the system to have trouble recording the collection method used? Why would the number of observations without a listed method code increase over the year?

As mentioned previously, these distributions don't necessarily have any direct implications, but they can be helpful to have in mind for subsequent analyses. Moreover, given that this section is interested in ozone readings & the fact that there are about 2,700 data points without a listed ozone reading, it could be worth investigating if certain method codes have more missing ozone readings than others. Understanding this could help indicate whether the 2,700 observations with missing ozone values can be dropped from the analysis or not.

In [46]:
#Determine the proportion of observations that are missing ozone values per 'Method Code'
num_obs_method = data_III_B['Method Code'].value_counts(dropna=False).reset_index().rename(
                    columns={'count':'# total obs.'}).merge(
        data_III_B[data_III_B['Daily Max 8-hour Ozone Concentration'].isna()]['Method Code'].value_counts(
            dropna=False).reset_index().rename(columns={'count':'# missing ozone values'}) , on='Method Code')
num_obs_method['Prop missing Ozone values (%)'] = 100*(num_obs_method['# missing ozone values']/num_obs_method['# total obs.'])
num_obs_method
Out[46]:
Method Code # total obs. # missing ozone values Prop missing Ozone values (%)
0 87.0 34890 1756 5.032961
1 47.0 8438 435 5.155250
2 Missing 6312 298 4.721166
3 199.0 4001 210 5.248688
4 53.0 386 19 4.922280

For each of the five categories of method codes, the number of observations that do not have a listed ozone reading is between 4.5 & 5.5 percent. Given that these proportions are rather comparable & not starkly inconsistent, it seems reasonable that there is no meaningful relationship between the missing 'Daily Max 8-hour Ozone Concentration' values & the missing 'Method Code' values. Furthermore, it is reasonable to drop the observations with missing ozone data because of how few data points they account for across each collectio method.

Next, direct analyses on the 'Method Code' variable in relation to ozone variables can be performed. First, aggregate values of ozone levels across different method codes can be obtained prior to undergoing time-series analyses.

In [47]:
#Remove observations with missing ozone data; 54,027 -> 51,309 rows
data_III_C = data_III_B.dropna(subset=['Daily Max 8-hour Ozone Concentration'])

#Obtain summary statistics of ozone metrics per 'Method Code'
oz_method_stats = data_III_C.groupby('Method Code')['Daily Max 8-hour Ozone Concentration'].describe().reset_index()

#Visualize ozone levels per 'Method Code'
ax1 = sns.boxplot(data=data_III_C, x='Method Code', y='Daily Max 8-hour Ozone Concentration')
#add the avg ozone level across the whole dataset
ax1.axhline(y=clean_df.dropna(subset=['Daily Max 8-hour Ozone Concentration'])['Daily Max 8-hour Ozone Concentration'].mean(),
           color='green', linestyle='--', alpha=0.8, label='Overall Avg')
#add the ozone level the EPA deems "unhealthy" (0.07 ppm)
ax1.axhline(y=0.07, color='red', linestyle=':', alpha=0.8, label='Unhealthy Levels of Ozone')
#superpose avg ozone level per 'Method Code'
sns.stripplot(data=oz_method_stats, x='Method Code', y='mean', color='orange', alpha=0.9, jitter=0, marker='*', size=8,
              ax=ax1, label='Method Avg')

#edit plot details
plt.title('Ozone Concentrations per Collection Method (California 2024)', fontsize=11)
plt.xticks(fontsize=9)
plt.ylabel('Daily Max 8-hour Ozone Concentration (ppm)')
plt.yticks(fontsize=8.5)
leg_temp = ax1.get_legend_handles_labels()
plt.legend(handles=leg_temp[0][:3], labels=leg_temp[1][:3], fontsize=9)
plt.grid(axis='y',alpha=0.2,color='gray')
plt.tight_layout();

The figure above illustrates distributions of daily ozone values across four unique collection methods (plus observations whose 'Method Code' was not listed) in California in 2024. The green, dashed, horizontal line indicates the average ozone value across the entire dataset, & the orange stars indicate the average ozone value for each 'Method Code.' The red, dotted, horizontal line indicates the crossover point at which concentratons of ozone become unhealthy & potentially dangerous according to the EPA. Note that the vertical axis is in units of parts per million (ppm).
Recall that the "53" collection method only had 367 observations with ozone readings whereas the "87" method had more than 33,000. The group of observations with a "Missing" method code was included in this analysis because of how many data points it accounted for (about 6,300 or 12% of the dataset).

The "53" method code had an average ozone reading about 0.015 ppm higher than any of the other three codes & the "Missing" group. While this could be a consequence of having a smaller number of associated data points, all of the ozone levels associated with this method code are generally greater than those of any of the other three codes. On the other hand, the most extreme values ozone levels, in terms of maximums, was considerably smaller than that of the two most popular method codes ("47" & "87").
Observations without a listed 'Method Code' generally observed ozone values which were considerably smaller than those of any of the four method codes.

Of the four method codes, three of them have an average daily ozone reading that is greater than the aggregate average across the whole dataset.
Generally, the more data points that a method code has, the wider the distribution of ozone values. The most popular code, "87," spans ozone values between zero & 0.140 ppm, whereas the least popular code, "53," spans ozone values between about 0.015-0.100 ppm.

Ultimately, the "53" collection method recorded considerably higher ozone readings than any other, whereas the "47" method had the smallest ozone values on average. The ozone values associated with the method codes "199" & "87" were generally indifferent, but they were typically greater than the average ozone value across the entire dataset. Ozone values for observations with an undefined 'Method Code' were noticeably lower than any of the four collection method codes.

Next, a similar process that also includes time can be performed.

In [48]:
#Obtain ozone statistics per 'Method Code'-Month pair
oz_mth_method_stats = data_III_B.groupby([data_III_B['New Date'].dt.month, 
        'Method Code'])['Daily Max 8-hour Ozone Concentration'].describe().reset_index().rename(columns={'New Date':'Month'})
#Merge the data & method-month template to get all method-month pairs
oz_mth_method_stats = meth_mth_template.merge(oz_mth_method_stats, on=['Month','Method Code'], how='outer')
#Add a month name variable
oz_mth_method_stats['Month Name'] = pd.to_datetime(oz_mth_method_stats['Month'], format='%m').dt.month_name()


#Visualize avg ozone levels per Month, Method Code
ax1 = sns.lineplot(data=oz_mth_method_stats, x='Month Name', y='mean', hue='Method Code', marker='.')
#add the avg ozone level across the entire dataset
ax1.axhline(y=clean_df.dropna(subset=['Daily Max 8-hour Ozone Concentration'])['Daily Max 8-hour Ozone Concentration'].mean(),
           color='black', linestyle='--', alpha=0.4, label='Overall Avg')
#add the ozone level the EPA deems "unhealthy" (0.07 ppm)
ax1.axhline(y=0.07, color='red', linestyle=':', alpha=0.8, label='Unhealthy Levels of Ozone')

#edit plot details
plt.title('Monthly Average Ozone Concentrations per Method Code (California 2024)')
plt.xlabel('Month')
plt.xticks(fontsize=8, rotation=20)
plt.ylabel('Avg Daily Ozone Level (ppm)')
plt.yticks(fontsize=9)
ax1.set(ylim=(0,0.083))
plt.legend(title='Method Code', title_fontsize=9, fontsize=8, loc='lower center', ncols=2)
plt.grid(axis='y',alpha=0.2,color='gray')
plt.tight_layout();

The figure above displays monthly average ozone concentrations across the four collection method codes as well as the groups of observations with a "Missing" code. The dark, dashed, horizontal line indicates the average ozone level observed across the entire state of California in 2024. The red, dotted, horizontal line indicates the crossover point at which concentratons of ozone become unhealthy & potentially dangerous according to the EPA. Recall that these five groups of method codes are built on vastly different sizes of data; the "53" code has less than 400 observations, whereas the "87" code has over 30,000.
Additionally, each of the five method-code groups have data points across each months except for the "53" code which doesn't have ozone data in the first three months of 2024.

Throughout this report, a general trend has been discovered regarding average ozone levels in how they evolved on a monthly basis throughout 2024 in California. In the beginning of the year, ozone levels are rather minimal but they rise steadily for the next several months before peaking in the early summer (i.e. July). From there on, the amount of ozone declined through the end of the year. Generally, this trend is replicated in the figure above, but there are some inconsistencies, the most apparent of which are discussed below.

  • The method code "47" as well as the observations with a "Missing" code exhibit ozone levels that opposed this trend in the late spring & early summer. More specifically, the average ozone levels of these two groups generally declined from May through July. Additionally, these average ozone levels of these two groups then increased gradually from August through October.
  • When excluding the "Missing" method code group, the average monthly ozone level across the four defined methods was greater than the yearly, statewide average from April through October. In the first three & the last two months of 2024, their monthly averages were all below this comprehensive average.

When evaluating the data in this plot for disparities across the different collection methods, it depends on the time of year. Moreover, when based on the general trend of average ozone levels throughout California in 2024, the greater the average ozone concentration, the greater this disparity of ozone levels between these five groups appears to be. The following notes reinforces this.

  • From January through March as well as November through December--when ozone levels were at their lowest in California in 2024--the average ozone readings for each of the five method codes were quite similar. Ozone levels across the five groups were the most similar in February. When excluding the "Missing" method code group, the average monthly ozone levels of the four distinct method codes are also fairly similar in April & May.
  • From May through October, the disparities in average ozone levels across the five groups is most significant, particularly in July when ozone levels generally peaked across the state in 2024.

In summary, ozone levels were generally comparable regardless of the collection method code between January through April as well as November & December. Between May & October, however, ozone levels were markedly disparate especially in July.

Analysis III-B Conclusions¶

Do different methods report different ozone levels?

In the dataset, there are four defined collection methods which are named using numerical codes; however, there were about 6,300 data points whose method code was missing. Instead of ignoring or removing them from this section, they were compiled into a single group with a "Missing" method code. So, five groups were analyzed in total in Analysis III-B.
Two general subanalyses were performed which involved evaluating the distributions & averages of ozone levels across these five method groups across both 2024 as a whole & on a monthly basis.

It is important to point out that the four defined method codes have a very different number of associated data points. The "53" code had less than 400 whereas the "87" code had more than 30,000 data points. While these disparities weren't immediately consequential, they could potentially be a factor in any differences found regarding ozone levels across the collection methods.

In California in 2024, the "53" collection method recorded considerably higher ozone readings than any other (~ 0.060 ppm), whereas the "47" method had the smallest ozone values on average (~ 0.042 ppm). The ozone values associated with the method codes "199" & "87" were generally indifferent (~ 0.045 ppm), but they were typically greater than the average ozone value across the entire dataset (~ 0.044 ppm). Ozone values for observations with an undefined 'Method Code' were noticeably lower than any of the four collection method codes (~ 0.036 ppm).

On a yearly basis, the time of year appeared to have an impact on ozone levels across the five groups of method codes. Generally, the differences in average ozone levels were greater when ozone concentrations were more intense.

  • From January through March as well as November through December--when ozone levels were at their lowest in California in 2024--the average ozone readings for each of the five method codes were quite similar. Ozone levels across the five groups were the most similar in February. When excluding the "Missing" method code group, the average monthly ozone levels of the four distinct method codes are also fairly similar in April & May.
    • For example, in February, the averages across the five groups were between about 0.033-0.037 ppm.
  • From May through October, the disparities in average ozone levels across the five groups is most significant, particularly in July when ozone levels generally peaked across the state in 2024.
    • In July, the averages across the five groups were between about 0.036-0.079 ppm.

Analysis IV¶

Consider if urban activity (weekend vs. weekday) has any effect on ozone levels across different days.

Urban activity typically differs based on the day of the week as well as the time of the day. For example, such activity during the week is most likely to peak early & later in the day during "rush hours" where people are commuting between their residences & workplaces. On the weekends, when a large part of the general population is not working, urban activity from commuting is typically much less predictable. Moreover, while more people may be more likely to stay home to relax & recharge from the week, they also have more time to go out & participate in activities that they wouldn't necessarily have the time to do during the week like going to a concert, on a short trip, or to shop. Furthermore, such weekend activities can also differ depending on the time of year due to warm or cold temperatures & if people are in school or not.
Ultimately, while urban activities usually differ during the week versus the weekend, their contribution to ozone quantities may be more similar than expected.

Recall the yearly trend of ozone levels in California in 2024 that was previously analyzed. Typically, ozone levels across the state were higher in the late spring & early summer whereas they were lower in the early winter. While this may not be immediately relevant in this section, it is useful to keep in mind that ozone levels on a daily basis may change depending on what time of year is being analyzed. Of course, the analysis could be expanded to evaluate a certain time of 2024 (i.e. a specific month or season), but the general analysis will encompass the entire year.

On another note, this section plans to analyze ozone levels across the entire state of California. This could be augmented with analyses on a locaton-based analysis like per county. Alternatively, given the number of counties in the state, an individual county could be focused on to evaluate how ozone levels differed day by day.

In order to evaluate ozone levels between weekdays & weekends, the day of each date needs to be obtained which can be done with the 'New Date' variable. Moreover, given that there are five days during the week & only two on the weekend, some metrics wouldn't make much sense to use like a sum. Instead, standardized metrics, like the mean, would be more appropriate because they inherently normalize disproportionate distributions. Fortunately, each day has a considerable & fairly similar number of associated observations so any worries regarding unrepresentative data points can be relieved.

To begin, daily averages of ozone levels can be obtained as well as statistics during the week versus the weekend.

In [49]:
#Filter for the relevant variables
data_IV = clean_df[['New Date','Daily Max 8-hour Ozone Concentration','County FIPS Code','County','Site Latitude',
                'Site Longitude']]
#Remove observations with missing ozone data
data_IV = data_IV.dropna(subset=['Daily Max 8-hour Ozone Concentration'])

#Create a variable denoting the day of the week (DoW)
data_IV['DoW'] = data_IV['New Date'].dt.day_name()
#Order the days by converting the variable to a category
data_IV['DoW'] = pd.Categorical(data_IV['DoW'], ordered=True, categories=['Monday','Tuesday','Wednesday','Thursday','Friday',
                                                                   'Saturday','Sunday'])

#Create a variable denoting if the day is during the week or weekend
data_IV['Week/Weekend'] = data_IV['DoW'].apply(lambda x: 'Weekend' if x in ['Saturday','Sunday'] else 'Week')
#Order the values by converting the variable to a category
data_IV['Week/Weekend'] = pd.Categorical(data_IV['Week/Weekend'], ordered=True, categories=['Week','Weekend'])

#data_IV.head(2)
In [50]:
#Get an initial understanding of ozone values per DoW & week/weekend
fig, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)

#DoW
ax1 = sns.boxplot(data=data_IV, x='DoW', y='Daily Max 8-hour Ozone Concentration', ax=axes[0])
#Week/Weekend
ax1 = sns.boxplot(data=data_IV, x='Week/Weekend', y='Daily Max 8-hour Ozone Concentration', ax=axes[1])


#edit plot details
for k, v in {'Ozone Levels per Day (California 2024)':[0,'Day of Week'], 
             'Ozone Levels on Week/Weekend (California 2024)':[1,'Week / Weekend']}.items():
    plt.sca(axes[v[0]])
    plt.title(k, fontsize=12)
    plt.xlabel(v[1], fontsize=9.5)
    plt.xticks(fontsize=8)
    plt.ylabel('Daily Max 8-Hour Ozone Concentration (ppm)', fontsize=9.5)
    plt.yticks(ticks=np.arange(0,0.15,0.01), fontsize=7.25)
    plt.grid(axis='y',alpha=0.2,color='gray')
    plt.tight_layout();
C:\Users\elija\AppData\Local\Temp\ipykernel_14484\1460845392.py:20: UserWarning:

The figure layout has changed to tight

The figures above display distributions of ozone levels across all of California in 2024. The plot on the left illustrates these levels by day, whereas the plot on the right distinguishes solely between weekdays & weekends. Note that the vertical axes are in units of parts per million (ppm). Recall that the average ozone level across the state in 2024 was about 0.044 ppm.

Generally, there are no major discernible differences between these figures. Some distributions on certain days are wider or higher than others but only slightly. The one notable disparity between categories in these plots is that the largest ozone values observed on the weekend were considerably smaller than those during the week.

Otherwise, by using these figures alone, there is no concrete evidence that would suggest that ozone levels meaningfully differed by day or between the week & weekend. With how similar these distributions are, it might be more useful to look at physical statistics rather than a figure.

In [51]:
#Obtain ozone statistics per DoW, week/weekend
display("Day-of-week statistics:")
display(data_IV.groupby('DoW')['Daily Max 8-hour Ozone Concentration'].describe())
display("-------------------------------------------------------------------------------------------------------------------")
display("Week/Weekend statistics:")
display(data_IV.groupby('Week/Weekend')['Daily Max 8-hour Ozone Concentration'].describe())
'Day-of-week statistics:'
count mean std min 25% 50% 75% max
DoW
Monday 7464.0 0.042771 0.014185 0.0 0.033 0.041 0.051 0.113
Tuesday 7289.0 0.043706 0.014778 0.0 0.034 0.042 0.052 0.121
Wednesday 7205.0 0.044169 0.015226 0.0 0.034 0.042 0.053 0.118
Thursday 7269.0 0.044005 0.014898 0.0 0.034 0.042 0.052 0.139
Friday 7327.0 0.043714 0.015180 0.0 0.034 0.041 0.052 0.131
Saturday 7382.0 0.043430 0.014737 0.0 0.034 0.041 0.051 0.114
Sunday 7373.0 0.043046 0.013210 0.0 0.035 0.042 0.050 0.099
'-------------------------------------------------------------------------------------------------------------------'
'Week/Weekend statistics:'
count mean std min 25% 50% 75% max
Week/Weekend
Week 36554.0 0.043667 0.014862 0.0 0.034 0.041 0.052 0.139
Weekend 14755.0 0.043238 0.013996 0.0 0.034 0.041 0.051 0.114

With these physical statistics, it becomes easier to discern how the ozone levels differ between days. Moreover, this form of the data just accentuates how small & insignificant the general differences are in ozone levels.

On a daily basis:

  • Ozone levels across California in 2024 were the most severe on Wednesdays & the least severe on Mondays respectively; however, the differences between these two averages is only about 0.0014 ppm, which is quite insignificant. The median values across the seven days of the week are within a range of about 0.001 ppm.
  • Once more, the major disparity of note of ozone levels pertains to the maximum values. The largest ozone values observed during Saturday & Sunday were the third smallest & the smallest maximum values respectively of the seven days (Monday saw the second smallest maximum). Also, Sundays were the only day where the maximum value was less than 0.100 ppm. Furthermore, Sundays exhibited a noticeably smaller variance in ozone levels indicating that ozone concentrations were at their lowest & the most stable on Sundays than any other day.

When comparing weekdays & weekends:

  • The average ozone level during the week was about 0.0004 ppm higher than it was on the weekends. They had the same median value.
  • Ozone levels on the weekdays had a greater variance of about 0.0009 than the weekend. This is likely because the maximum ozone values observed during the week were about 0.025 ppm greater than the weekend.

Ultimately, while ozone levels were generally slightly smaller on the weekend than during the week, particularly on Sunday, the differences between the two groups are too meager to credibly state that they are meaningful. As such, it can be stated that urban activity did not have a considerable effect on ozone levels between the week & weekend.

Next, the analysis was expanded to evaluate the ozone levels between weekdays & weekends on a monthly level throughout 2024.

In [52]:
#Obtain ozone statistics per month for weeks/weekends
wk_wknd_oz_stats = data_IV.groupby([data_IV['New Date'].dt.month,'Week/Weekend'])[
                    'Daily Max 8-hour Ozone Concentration'].describe().reset_index().rename(columns={'New Date':'Month'})
#Create a variable denoting the name of the month
wk_wknd_oz_stats['Month Name'] = pd.to_datetime(wk_wknd_oz_stats['Month'], format='%m').dt.month_name()


#Visualize the data
ax1 = sns.lineplot(data=wk_wknd_oz_stats, x='Month Name', y='mean', hue='Week/Weekend', style='Week/Weekend', marker='o', 
                  alpha=0.8)

#edit plot details
plt.title('Average Monthly Ozone Levels On Weekdays & Weekends (California 2024)', fontsize=11)
plt.xlabel('Month')
plt.xticks(fontsize=8, rotation=20)
plt.ylabel('Avg Ozone Concentration (ppm)')
plt.yticks(ticks=np.arange(0,0.061,0.005), fontsize=8)
ax1.set(ylim=(0,0.060))
plt.legend(title='Week/Weekend', title_fontsize=9, fontsize=8)
plt.grid(axis='y',alpha=0.2,color='gray')
plt.tight_layout();

The line plot above illustrates the average monthly ozone levels on weekdays & weekends throughout 2024 in Califonia. Note that the vertical axis is in units of parts per million (ppm).

Recall the general trend of average monthly ozone levels throughout California in 2024: ozone levels were at their lowest in January & December whereas they peaked in July. The monthly data demonstrated by the weekdays replicates this trend quite well. The monthly data on the weekends is less consistent in that it peaks in September instead of July, & it doesn't increase & decrease smoothly across consecutive months between the peak & bottommost value.
In comparing the monthly data points between the two groups, their average ozone values were very similar in the first four & last three months in 2024. Also, the weekend ozone values were slightly higher in each of these seven months. In the five intermediate months of the year, the disparities are more sizable though they are still quite close--the largest difference is about 0.004 ppm. In four of these five months, weekdays exhibited greater average ozone levels than the weekends.
So, weekends exhibited a greater average ozone level in eight of the 12 months in 2024, but when monthly ozone averages were at their greatest, ozone levels during the week were greater than on weekends.

Once more, while there were some dissimilarities in average ozone levels between weekdays & weekends on a monthly basis throughout 2024, the magnitudes of these disparities are still not considerable enough to credibly state that urban activity had a considerable effect on ozone levels between the week & weekend.

This analysis could be expanded to evalaute the average monthly ozone levels on a daily basis. More specifically, were ozone levels consistent &/or meaningfully higher on certain days of the week than others?

Next, data pertaining to the Los Angeles (LA) county in California was looked at in isolation to search further for potential patterns in ozone levels between weeks & weekends.

In [53]:
#Filter the data for observations in the LA county -- 4,055 data points (w/ozone data)
data_IV_LA = data_IV[data_IV.County == 'Los Angeles']

#Obtain statistics of ozone data per day, week/weekend
LA_mth_wk_wknd_oz_stats = data_IV_LA.groupby([data_IV_LA['New Date'].dt.month,'Week/Weekend'])[
                    'Daily Max 8-hour Ozone Concentration'].describe().reset_index().rename(columns={'New Date':'Month'})
#Create a variable indicating the month name
LA_mth_wk_wknd_oz_stats['Month Name'] = pd.to_datetime(LA_mth_wk_wknd_oz_stats['Month'], format='%m').dt.month_name()


#Visualize the data
ax1 = sns.lineplot(data=LA_mth_wk_wknd_oz_stats, x='Month Name', y='mean', hue='Week/Weekend', alpha=0.8, marker='o', 
                  style='Week/Weekend')

#edit plot details
plt.title('Average Monthly Ozone Levels On Weekdays & Weekends (Los Angeles 2024)', fontsize=11)
plt.xlabel('Month')
plt.xticks(fontsize=8, rotation=20)
plt.ylabel('Avg Ozone Concentration (ppm)')
plt.yticks(ticks=np.arange(0,0.071,0.005), fontsize=8)
ax1.set(ylim=(0,0.070))
plt.legend(title='Week/Weekend', title_fontsize=9, fontsize=8, loc='lower center')
plt.grid(axis='y',alpha=0.2,color='gray')
plt.tight_layout();

The line plot above illustrates the average monthly ozone levels on weekdays & weekends throughout 2024 in Los Angeles, Califonia. Note that the vertical axis is in units of parts per million (ppm).

Similarly to the California-wide monthly averages between the week & weekends, the monthly ozone levels are quite similar for most of the year. There are only two months in which the disparities in these averages differ considerably--July & September. The largest of these monthly disparities is about 0.009 ppm which is still relatively small.
Nonetheless, the average monthly ozone levels on the weekend in Los Angeles were greater than those during the week in eight of the 12 months in 2024.

Once more, while there were some dissimilarities in average ozone levels between weekdays & weekends on a monthly basis throughout 2024, the magnitudes of these disparities are still not considerable enough to credibly state that urban activity had a considerable effect on ozone levels between the week & weekend in Los Angeles, California.

Next, the distributions of ozone levels for each day of the week in 2024 across Los Angeles were analyzed.

In [54]:
#Get ozone statistics per DoW
display(data_IV_LA.groupby('DoW')['Daily Max 8-hour Ozone Concentration'].describe())

#Visualize ozone data per DoW
ax1 = sns.boxplot(data=data_IV_LA, x='DoW', y='Daily Max 8-hour Ozone Concentration')
#edit plot details
plt.title('Ozone Levels per Day (Los Angeles 2024)')
plt.xlabel('Day')
plt.xticks(fontsize=8.5)
plt.ylabel('Daily Max 8-hour Ozone Concentration (ppm)')
plt.yticks(fontsize=8, ticks=np.arange(0,0.15,0.02))
plt.grid(axis='y',alpha=0.2,color='gray')
plt.tight_layout();
count mean std min 25% 50% 75% max
DoW
Monday 592.0 0.047047 0.014913 0.007 0.03775 0.0445 0.055 0.100
Tuesday 587.0 0.047525 0.015708 0.008 0.03700 0.0460 0.058 0.111
Wednesday 575.0 0.048663 0.016426 0.012 0.03800 0.0460 0.058 0.114
Thursday 572.0 0.048731 0.016203 0.012 0.03800 0.0445 0.059 0.112
Friday 574.0 0.048533 0.017300 0.015 0.03700 0.0440 0.058 0.108
Saturday 589.0 0.049031 0.016194 0.016 0.03900 0.0460 0.058 0.114
Sunday 566.0 0.049415 0.013897 0.014 0.04000 0.0460 0.058 0.094

The figures above display distributions of ozone levels across the seven days of the week in Los Angeles, California in 2024. Note that the vertical axes are in units of parts per million (ppm). Recall that the average ozone level across the state in 2024 was about 0.044 ppm.

In comparison to the statewide ozone levels, differences between them on a daily basis in Los Angeles are slightly more discernible but they're still minimal. From the ranges of the boxes, it appears that ozone levels were the most volatile on Tuesdays & Thursdays, whereas they were the most stable on Mondays & Sundays. The median lines indicate that ozone levels were typically slightly higher on Tuesdays, Wednesdays, & the weekends, whereas they were lower on Mondays & Fridays.
From the statistics, the averages indicate that ozone levels were greater on Saturdays & Sundays by 0.0005 ppm or more, which is quite insignificant. On average, ozone levels were least abundant on Mondays.

Ultimately, day-by-day ozone levels were slightly higher on average on Saturdays & Sundays, but they are generally indistinguishable on a daily basis in Los Angeles in 2024.

Analysis IV - Conclusions¶

Consider if urban activity (weekend vs. weekday) has any effect on ozone levels across different days.

Before analysis was conducted, it was theorized that ozone levels may differ between weekdays & weekends because different kinds & quantities of urban activity usually take place during the week & on the weekend. For example, a large chunk of the general population has to commute to & from their workplace between Monday & Friday, whereas they are more likely to stay home on the weekend.
By evaluating ozone levels between weekdays & weekends as well as on a daily basis, this theory was shown to be inconclusive.

First, ozone distributions were analyzed across each day of the week (Sunday through Saturday) & between weekdays (Monday through Friday) & weekends (Saturday & Sunday). This was done across the entire state of California in general & on a monthly basis.

  • Overall, ozone levels were slightly less severe in California on the weekend than during the week, particularly on Sunday; however, the differences between the two groups are too meager to credibly state that they are meaningful. As such, it can be stated that urban activity did not have a considerable effect on ozone levels between the week & weekend.
  • On a monthly basis, ozone levels were higher on weekends in eight of the 12 months, but the largest ozone observations occurred on weekdays. When ozone levels were higher in 2024 (late spring & early summer), they were generally greater during the week, but as ozone levels declined in other parts of the year, the average ozone levels on the weekend overtook those during the week.
    • Once more, while there were some dissimilarities in average ozone levels between weekdays & weekends on a monthly basis throughout 2024, the magnitudes of these disparities are still not considerable enough to credibly state that urban activity had a considerable effect on ozone levels between the week & weekend.

In addition, the county of Los Angeles was analyzed in isolation to determine if the ozone patterns were consistent with what was found previously.

  • Overall, day-by-day ozone levels were slightly higher on average on Saturdays & Sundays, but these differences are pratically indistinguishable & insignificant.
  • On a monthly basis, the ozone observations were not as consistent with the statewide trends, particularly those on the weekend. Again, the average monthly ozone levels on the weekend in Los Angeles were greater than those on the weekdays in eight of the 12 months in 2024. Only two months exhibited relatively "larger" disparities in their average monthly ozone levels--July & September--but the largest of these monthly differences was only about 0.009 ppm.

In summary, there are some small differences in ozone levels on a daily basis in California throughout 2024; however, the magnitudes of these disparities are too minute to credibly state that urban activity had a considerable effect on ozone levels between the week & weekend. This was found to be the case throughout the whole state, on a monthly basis, & in the county of Los Angeles.


Sources¶

  • Cover image
  • EPA (dataset source)
  • AirNow
  • Wikipedia - California CBSA's
  • California State Association of Counties
  • EPA - Air Pollutants
  • GitHub - CA Counties